Lazy Programmer

Your source for the latest in deep learning, big data, data science, and artificial intelligence. Sign up now

Probability Smoothing for Natural Language Processing

January 23, 2016

alejandroandrade-wordcloud

Level: Beginner

Topic: Natural language processing (NLP)

This is a very basic technique that can be applied to most machine learning algorithms you will come across when you’re doing NLP.

Suppose for example, you are creating a “bag of words” model, and you have just collected data from a set of documents with a very small vocabulary. Your dictionary looks like this:

{"cat": 10, "dog": 10, "parrot": 10}

You would naturally assume that the probability of seeing the word “cat” is 1/3, and similarly P(dog) = 1/3 and P(parrot) = 1/3.

Now, suppose I want to determine the probability of P(mouse). Since “mouse” does not appear in my dictionary, its count is 0, therefore P(mouse) = 0.

This is a problem!

If you wanted to do something like calculate a likelihood, you’d have $$ P(document) = P(words that are not mouse) \times P(mouse) = 0 $$

This is where smoothing enters the picture.

We simply add 1 to the numerator and the vocabulary size (V = total number of distinct words) to the denominator of our probability estimate.

$$ P(word) = \frac{word count + 1}{total number of words + V} $$

Now our probabilities will approach 0, but never actually reach 0.

For a word we haven’t seen before, the probability is simply:

$$ P(new word) = \frac{1}{N + V} $$

You can see how this accounts for sample size as well.

If our sample size is small, we will have more smoothing, because N will be smaller.

 

N-gram probability smoothing for natural language processing

An n-gram (ex. bigram, trigram) is a probability estimate of a word given past words.

For example, in recent years, \( P(scientist | data) \) has probably overtaken \( P(analyst | data) \).

In general we want to measure:

$$ P(w_i | w_{i-1}) $$

This probably looks familiar if you’ve ever studied Markov models.

You can see how such a model would be useful for, say, article spinning.

You could potentially automate writing content online by learning from a huge corpus of documents, and sampling from a Markov chain to create new documents.

Disclaimer: you will get garbage results, many have tried and failed, and Google already knows how to catch you doing it. It will take much more ingenuity to solve this problem.

The maximum likelihood estimate for the above conditional probability is:

$$ P(w_i | w_{i-1}) = \frac{count(w_i | w_{i-1})}{count(w_{i-1})} $$

You can see that as we increase the complexity of our model, say, to trigrams instead of bigrams, we would need more data in order to estimate these probabilities accurately.

$$ P(w_i | w_{i-1}, w_{i-2}) = \frac{count(w_i | w_{i-1}, w_{i-2})}{count(w_{i-1}, w_{i-2})} $$

So what do we do?

You could use the simple “add-1” method above (also called Laplace Smoothing), or you can use linear interpolation.

What does this mean? It means we simply make the probability a linear combination of the maximum likelihood estimates of itself and lower order probabilities.

It’s easier to see in math…

$$ P(w_i | w_{i-1}, w_{i-2}) = \lambda_3 P_{ML}(w_i | w_{i-1}, w_{i-2}) + \lambda_2 P_{ML}(w_i | w_{i-1}) + \lambda_1 P_{ML}(w_i) $$

We treat the lambda’s like probabilities, so we have the constraints \( \lambda_i \geq 0 \) and \( \sum_i \lambda_i = 1 \).

The question now is, how do we learn the values of lambda?

One method is “held-out estimation” (same thing you’d do to choose hyperparameters for a neural network). You take a part of your training set, and choose values for lambda that maximize the objective (or minimize the error) of that training set.

If you have ever studied linear programming, you can see how it would be related to solving the above problem.

Another method might be to base it on the counts. This would work similarly to the “add-1” method described above. If we have a higher count for \( P_{ML}(w_i | w_{i-1}, w_{i-2}) \), we would want to use that instead of \( P_{ML}(w_i) \). If we have a lower count we know we have to depend on\( P_{ML}(w_i) \).

Good-Turing smoothing and Kneser-Ney smoothing

These are more complicated topics that we won’t cover here, but may be covered in the future if the opportunity arises.

Have you had success with probability smoothing in NLP? Let me know in the comments below!