FREE Exercise: Predict Stocks with News, + Other ML News

January 19, 2022

TL;DR: this is an article about how to predict stocks using the news.

In this article, we are going to do an exercise involving my 2 current favorite subjects: natural language processing and financial engineering!

I’ll present this as an exercise / tutorial, so hopefully you can follow along on your own.

One comment I frequently make about predicting stocks is that autoregressive time series models aren’t really a great idea.

Basic analysis (e.g. ACF, PACF) shows no serial correlation in returns (that is, there’s no correlation between past and future) and hence, the future is not predictable from the past.

The best-fitting ARIMA model is more often than not, a simple random walk.

What is a random walk? If you haven’t yet learned this from me, then basically think of it like flipping a coin at each time step. The result of the coin flip tells you which way to walk: up the street or down the street.

Just as you can’t predict the result of a coin flip from past coin flips (by the way, this is essentially the gambler’s fallacy!), so too is it impossible to predict the next step of a random walk.

In these situations, the best prediction is simply the last-known value.

This is why, when one tries to fit an LSTM to a stock price time series, all it ends up doing is predicting close to the previous value.

There is a nice quote which is unfortunately (as far as I know) unattributed, that says something like: “trying to predict the future from the past is like trying to drive by looking through the rearview mirror”.

Anyway, this brings us to the question: “If I don’t use past prices, then what do I use?”

One common approach is to use the news.

We’ve all seen that news and notable events can have an impact on stock / cryptocurrency prices. Examples:

  • The Omicron variant of COVID-19
  • High inflation
  • Supply-chain issues
  • Elon Musk tweeting about Dogecoin
  • Mark Zuckerberg being grilled by the government

Luckily, I’m not going to make you scrape the web to download news yourself.

Instead, we’re going to use a pre-built dataset, which you can get at:

Briefly, you’ll want to look at the “combined” CSV file which has the following columns:

  • Date (e.g. 2008-08-11 – daily data)
  • Label (0 or 1 – whether or not the DJIA went up or down)
  • Top1, Top2, …, Top25 (news in the form of text, retrieved from the top 25 Reddit news posts)

Note that this is a binary classification problem.

Thanks to my famous rule, “all data is the same“, your code should be no different than a simple sentiment analysis / spam detection script.

To start you off, I’ll present some basic starter code / tips.


Tip 1) Some text contains weird formatting, e.g.

b”Georgia ‘downs two Russian warplanes’ as cou…

Basically, it looks like how a binary string would be printed out, but the “b” is part of the actual string.

Here’s a simple way to remove unwanted characters:


Tip 2) Don’t forget that this is time-ordered data, so you don’t want to do a train-test split with shuffling (mixing future and past in the train and test sets). The train set should only contain data that comes before the test set.


Tip 3) A simple way to form feature vectors from the news would be to just concatenate all 25 news columns into a single text, and then apply TF-IDF. E.g.

I’ll leave the concatenation part as an exercise for you.


Here are some extra thoughts to consider:

  • How were the labels created? Does that method make sense? Is it based on close-close or open-close?
  • What were the exact times that the news was posted? Was there sufficient time between the latest news post and the result from which the label is computed?
  • Returns tend to be very noisy. If you’re getting something like 85% test accuracy, you should be very suspicious that you’ve done something wrong. A more realistic result would be around 50-60%. Even 60% would be considered suspiciously high.


So that’s basically the exercise. It is simple, yet hopefully thought-provoking.


Now I didn’t know where else to put this ML news I found recently, but I enjoyed it so I want to share it with you all.

First up: “Chatbots: Still Dumb After All These Years

I enjoyed this article because I get a lot of requests to cover Chatbots.

Unfortunately, Chatbot technology isn’t very good.

Previously, we used seq2seq (and also seq2seq with attention) which basically just learns to copy canned responses to various inputs. seq2seq means “sequence to sequence” so the input is a sequence (a prompt) and the target/output is a sequence (the chatbot’s response).

Even with Transformers, the best results are still lacking.


Next: “PyTorch vs TensorFlow in 2022

Wait, people are still talking about this in 2022? You betcha!

Read this article. It says a lot of the same stuff I’ve been saying myself. But it’s nice to hear it from someone else.

It also provides actual metrics which I am too lazy to do.


Finally: “Facebook’s advice to students interested in artificial intelligence

This isn’t really “new news” (in fact, Facebook isn’t even called Facebook anymore) but I recently came across this old article I saved many years earlier.

Probably the most common beginner question I get is “why do I need to do all this math?” (in my ML courses).

You’ve heard the arguments from me hundreds of times.

Perhaps you are hesitant to listen to me. That would be like listening to your parents. Yuck.

Instead, why not listen to Yann LeCun? Remember that guy? The guy who invented CNNs?

He’s the Chief AI Scientist at Facebook (Meta) now, so if you want a job there, you should probably listen to his advice…

And if you think Google, Netflix, Amazon, Microsoft, etc. are any different, well, that is wishful thinking my friends.

What do you think?

Is this convincing? Or is Yann LeCun just as wrong as I am?

Let me know!

Go to comments

Convert a Time Series Into an Image with Gramian Angular Fields and Markov Transition Fields

August 30, 2021

In my latest course (Time Series Analysis), I made subtle hints in the section on Convolutional Neural Networks that instead of using 1-D convolutions on 1-D time series, it is possible to convert a time series into an image and use 2-D convolutions instead.

CNNs with 2-D convolutions are the “typical” kind of neural network used in deep learning, which normally are used on images (e.g. ImageNet, object detection, segmentation, medical imaging and diagnosis, etc.)

In this article, we will look at 2 ways to convert a time series into an image:

  1. Gramian Angular Field
  2. Markov Transition Field



Gramian Angular Field


The Gramian Angular Field is quite involved mathematically, so this article will discuss the intuition only, along with the code.

Those interesting in all the gory details are encouraged to read the paper, titled “Encoding Time Series as Images for Visual Inspection and Classification Using Tiled Convolutional Neural Networks” by Zhiguang Wang and Tim Oates.

We’ll build the intuition in a series of steps.

Let us begin by recalling that the dot product or inner product is a measure of similarity between two vectors.

$$\langle a, b\rangle = \lVert a \rVert \lVert b \rVert \cos \theta$$

Where \( \theta \) is the angle between \( a \) and \( b \).

Ignoring the magnitude of the vectors, if the angle between them is small (i.e. close to 0) then the cosine of that angle will be nearly 1. If the angle is perpendicular, the cosine of the angle is 0. If the two vectors are pointing in opposite directions, then the cosine of the angle will be -1.

The Gram Matrix is just the repeated application of the inner product between every vector in a set of vectors, and every other vector in that same set of vectors.

i.e. Suppose that we store a set of column vectors in a matrix called \( X \).

The Gram Matrix is:

$$ G = X^TX $$

This expands to:

$$G = \begin{bmatrix} \langle x_1, x_1 \rangle & \langle x_1, x_2 \rangle & … & \langle x_1, x_N \rangle \\ \langle x_2, x_1 \rangle & \langle x_2, x_2 \rangle & … & \langle x_2, x_N \rangle \\ … & … & … & … \\ \langle x_N, x_1 \rangle & \langle x_N, x_2 \rangle & … & \langle x_N, x_N \rangle \end{bmatrix} $$

In other words, if we think of the inner product as the similarity between two vectors, then the Gram Matrix just gives us the pairwise similarity between every vector and every other vector.


Note that the Gramian Angular Field (GAF) does not apply the Gram Matrix directly (in fact, each value of the time series is a scalar, not a vector).

The first step in computing the GAF is to normalize the time series to be in the range [-1, +1].

Let’s assume we are given a time series \( X = \{x_1, x_2, …, x_N \} \).

The normalized values are denoted by \( \tilde{x_i} \).

The second step is to convert each value in the normalized time series into polar coordinates.

We use the following transformation:

$$ \phi_i = \arccos \tilde{x_i}$$

$$ r_i = \frac{t_i}{N} $$

Where \( t_i \in \mathbb{N} \) represents the timestamp of data point \(x _i \).

Finally, the GAF method defines its own “special” inner product as:

$$ \langle x_1, x_2 \rangle = \cos(\phi_1 + \phi_2) $$

From here, the above formula for \( G \) still applies (except using \( \tilde{X} \) instead of \( X \), and using the custom inner product instead of the usual version).

Here is an illustration of the process:

So why use the GAF?

Like the original Gram Matrix, it gives you a “picture” (no pun intended) of the relationship between every point and every other point in the time series.

That is, it displays the temporal correlation structure in the time series.

Here’s how you can use it in code.

Firstly, you need to install the pyts library. Then, run the following code on a time series of your choice:


Note that the library allows you to rescale the image with the image_size argument.

As an exercise, try using this method instead of the 1-D CNNs we used in the course and compare their performance!


Markov Transition Field

The Markov Transition Field (MTF) is another method of converting a time series into an image.

The process is a bit simpler than that of the GAF.

If you have taken any of my courses which involve Markov Models (like Natural Language Processing, or HMMs) you should feel right at home.

Let’s assume we have an N-length time series.

We begin by putting each value in the time series into quantiles (i.e. we “bin” each value).

For example, if we use quartiles (4 bins), the smallest 25% of values would define the boundaries of the first quartile, the second smallest 25% of values would define the boundaries of the second quartile, etc.

We can think of each bin as a ‘state’ (using Markov model terminology).

Intuitively, we know that what we’d like to do when using Markov models is to form the state transition matrix.

This matrix has the values:

$$A_{ij} = P(s_t = j | s_{t-1} = i)$$

That is, \( A_{ij} \) is the probability of transitioning from state i to state j.

As usual, we estimate this value by maximum likelihood. ( \( A_{ij} \) is the count of transitions from i to j, divided by the total number of times we were in state i).

Note that if we have \( Q \) quantiles (i.e. we have \( Q \) “states”), then \( A \) is a \( Q \times Q \) matrix.

The MTF follows a similar concept.

The MTF (denoted by \( M \)) is an \( N \times N \) matrix where:

$$M_{kl} = A_{q_k q_l}$$

And where \( q_k \) is the quantile (“bin”) for \( x_k \), and \( q_l \) is the quantile for \( x_l \).

Note: I haven’t re-used the letters i and j to index \( M \), which most resources do and it’s super confusing.

Do not mix up the indices for \( M \) and \( A \)! The indices in \( A \) refer to states. The indices for \( M \) are temporal.

\( A_{ij} \) is the probability of transitioning from state i to state j.

\( M_{kl} \) is the probability of a one-step transition from the bin for \( x_k \), to the bin for \( x_l \).

That is, it looks at \( x_k \) and \( x_l \), which are 2 points in the time series at arbitrary time steps \( k \) and \( l \).

\( q_k \) and \( q_l \) are the corresponding quantiles.

\( M_{kl} \) is then just the probability that we saw a direct one-step (i.e. Markovian) transition from \( q_k \) to \( q_l \) in the time series.

So why use the MTF?

It shows us how related 2 arbitrary points in the time series are, relative to how often they appear next to each other in the time series.


Here’s how you can use it in code.

Note that the library allows you to rescale the image with the image_size argument.

As an exercise, try using this method instead of the 1-D CNNs we used in the course and compare their performance


Go to comments

Should you study the theory behind machine learning?

August 23, 2021

In this post, I want to discuss why you should not study the theory behind machine learning.

This may surprise some of you, since my courses can appear to be more “theoretical” than other ML courses on popular websites such as Udemy.

However, that is not the kind of “theory” I am talking about.


Most popular courses in ML don’t look at any math at all.

They are popular precisely for this reason: lack of math makes them accessible to the average Joe.

This does a disservice to you students, because you end up not having any solid understanding about how the algorithm works.

You may end up:

  • doing things that don’t make sense, due to that lack of understanding.
  • only being able to copy code from others, but not write any code yourself.
  • not knowing how to apply algorithms to new kinds of data, without someone showing you how first.

For more discussion on that, see my post: “Why do you need math for machine learning and deep learning?

But let’s make this clear: math != theory.


When we look at math in my courses, we only look at the math needed to derive the algorithm and understand how it works at an intuitive level.

Yes, believe it or not, we are using math to improve our intuition.

This is despite what many beginners might think. When they see math, they automatically assume “math” = “not intuitive”, and that “intuitive” = “pictures, animations, and purposely avoiding math”.

That’s OK if you want to read a news article in the NY Times about ML, but not when you want to be a practitioner of ML.

Those are 2 different levels of “intuition” (layman vs. practitioner).


To see an extreme example of this, one need not look any further than Albert Einstein. Einstein was great at communicating his ideas to the public. Everyone can easily understand the layman interpretation of general relativity (mass bends space and time). But this is not the same as being a practitioner of relativistic physics.

Everyone has seen this picture and understands what it means at a high level. But does that mean you are a physicist or that you can “do physics”?

Anyway, that was just an aside so we don’t confuse “math used for intuition” and “layman intuition” and “theory”. These are 3 separate things. Just because you’re looking at some math, does not automatically imply you’re looking at “theory”.



What do we mean by “theory”?

Here’s a simple question to consider. Why does gradient descent work?

Despite the fact that we have used gradient descent in many of my courses, and derived the gradient descent update rules for neural networks, SVMs, and other models, we have never discussed why it works.

And that’s OK!

The “mathematical intuition” is enough.

But let’s get back to the question of this article: Why is the Lazy Programmer saying we should not study theory?


Well, this is the kind of “theory” that gets so deep, it:

  • Does not produce any near-term gains in your work
  • Requires a very high level of math ability (e.g. real analysis, optimization, dynamical systems)
  • Is on the cutting-edge of understanding, and thus very difficult, likely to be disputed or even superseded in the near future


Case in point: although we have been using gradient descent for years in my courses (and decades before that in general), our understanding is still not yet complete.

Here’s an article that just came out this year on gradient descent (August 2021): “Computer Scientists Discover Limits of Major Research Algorithm“.

Here’s a direct link to the corresponding paper, called “The Complexity of Gradient Descent: CLS = PPAD ∩ PLS”:

There will be more papers on these “theory” topics in the years to come.


My advice is not to go down this path, unless you really enjoy it, you are doing graduate research (e.g. PhD-level), you don’t mind if ideas you spent years and years working on might be proven incorrect, and you have a very high level of math ability in subjects like real analysis, optimization, and dynamical systems.

Go to comments

Predicting Stock Prices with Facebook Prophet

August 3, 2021

Prophet is Facebook’s library for time series forecasting. It is mainly geared towards business datasets (e.g. predicting adspend or CPU usage), but a natural question that comes up with my students whenever we talk about time series is: “can it predict stock prices?”

In this article, I will discuss how to use FB Prophet to predict stock prices, and I’ll also show you what not to do (things I’ve seen in other popular blogs). Furthermore, we will benchmark the Prophet model with the naive forecast, to check whether or not one would really want to use this.

Note: This is an excerpt from my full VIP course, “Time Series Analysis, Forecasting, and Machine Learning“. If you want the code for this example, along with many, many other code examples on stock prices, sales data, and smartphone data, get the course!

The Prophet section will be part of the VIP version only, so get it now while the VIP coupon is still active!


How does Prophet work?

The Prophet model is a 3 component, non-autoregressive time series model. Specifically:

$$y(t) = g(t) + s(t) + h(t) + \varepsilon(t)$$


The Prophet model is not autoregressive, like ARIMA, exponential smoothing, and the other methods we study in a typical time series course (including my own).

The 3 components are:

1) The trend \( g(t) \) which can be either linear or logistic.

2) The seasonality \( s(t) \), modeled using a Fourier series.

3) The holiday component \( h(t) \), which is essentially a one-hot vector “dotted” with a vector of weights, each representing the contribution from their respective holiday.


How to use Prophet for predicting stock prices

In my course, we do 3 experiments. Our data is Google’s stock price from approximately 2013-2018, but we only use the first 2 years as training data.

The first experiment is “plug-and-play” into Prophet with the default settings.


Here are the results:

Unfortunately, Prophet mistakenly believes there is a weekly seasonal component, which is the reason for the little “hairs” in the forecast.

When we plot the components of the model, we see that Prophet has somehow managed to find some weekly seasonality.

Of course, this is completely wrong! The model believes that the stock price increases on the weekends, which is highly unlikely because we don’t have any data for the weekend.


The second experiment is an example of what not to do. I saw this in every other popular blog, which is yet another “data point” that should convince you not to trust these popular data science blogs you find online (except for mine, obviously).

In this experiment, we set daily_seasonality to True in the model constructor.


Here are the results.

It seems like those weird little “hairs” coming from the weekly seasonal component have disappeared.

“The Lazy Programmer is wrong!” you may proclaim.

However, this is because you may not understand what daily seasonality really means.

Let’s see what happens when we plot the components.

This plot should make you very suspicious. Pay attention to the final chart.

“Daily seasonality” pertains to a pattern that repeats everyday with sub-daily changes.

This cannot be the case, because our data only has daily granularity!

Lesson: don’t listen to those “popular” blogs.


For experiment 3, we set weekly seasonality to False. Alternatively, you could try playing around with the priors.


Here are the results.

Notice that the “little hairs” are again not present.


Is this model actually good?

Just because you can make a nice chart, does not mean you have done anything useful.

In fact, you see the exact same mistakes in those blog articles and terrible Udemy courses promising to “predict stock prices with LSTMs” (which I will call out every chance I get).

One of the major mistakes I see in nearly every blog post about predicting stock prices is that they don’t bother to compare it to a benchmark. And as you’ll see, the benchmark for stock prices is quite a low bar – there is no reason not to compare.

Your model is only useful if it can beat the benchmark.

For stock price predictions, the benchmark is typically the naive forecast, which is the optimal forecast for a random walk.

Random walks are often used as a model for stock prices since they share some common attributes.

For those unfamiliar, the naive forecast is simply where you predict the last-known value.

Example: If today’s price on July 5 is $200 and I want to make a forecast with a 5-day horizon, then I will predict $200 for July 6, $200 for July 7, …, and $200 for July 10.

I won’t bore you with the code (although it’s included in the course if you’re interested), but the answer is: Prophet does not beat the naive forecast.

In fact, it does not beat the naive forecast on any horizon I tried (5 days, 30 days, 60 days).

Sidenote: it’d be a good exercise to try 1 day as well.


How to learn more

Are stock prices really random walks? Although this particular example provides evidence supporting the random walk hypothesis, in my course, the GARCH section will provide strong evidence against it! Again, it’s all explained in my latest course, “Time Series Analysis, Forecasting, and Machine Learning“. Only the VIP version will contain the sections on Prophet, GARCH, and other important tools.

The VIP version is intended to be limited-time only, and the current coupon expires in less than one month!

Get your copy today while you still can.

Go to comments

Why do you need math for machine learning and deep learning?

July 9, 2021

In this article, I will demonstrate why math is necessary for machine learning, data science, deep learning, and AI.

Most of my students have already heard this from me countless times. College-level math is a prerequisite for nearly all of my courses already.

This article is a bit different.

Perhaps you may believe I am biased, because I’m the one teaching these courses which require all this math.

It would seem that I am just some crazy guy, making things extra hard for you because I like making things difficult.


You’ve heard it from me many times. Now you’ll hear it from others.

This article is a collection of resources where people other than myself explain the importance of math in ML.


Example #1

Let’s begin with one of the most famous professors in ML, Daphne Koller, who co-founded Coursera.

In this clip, Lex Fridman asks what advice she would have for those interested in beginning a journey into AI and machine learning.

One important thing she mentions, which I have seen time and time again in my own experience, is that those without typical prerequisite math backgrounds often make mistakes and do things that don’t make sense.

She’s being nice here, but I’ve met many of these folks who not only have no idea that what they are doing does not make sense, they also tend to be overly confident about it!

Then it becomes a burden for me, because I have to put in more effort explaining the basics to you just to convince you that you are wrong.

For that reason, I generally advise against hiring people for ML roles if they do not know basic math.


Example #2

I enjoyed this strongly worded Reddit comment.

Original post:

Top comment:


Example #3

Not exactly machine learning, but very related field: quant finance.

In fact, many students taking my courses dream about applying ML to finance.

Well, it’s going to be pretty hard if you can’t pass these interview questions.

Think about this logically: All quants who have a job can pass these kinds of interview questions. But you cannot. How well do you think you will do compared to them?


Example #4

Entrepreneur and angel investor Naval Ravikant explains why deriving (what we do in all of my in-depth machine learning courses) is much more important than memorizing on the Joe Rogan Experience.

Most beginner-level Udemy courses don’t derive anything – they just tell you random facts about ML algorithms and then jump straight to the usual 3 lines of scikit-learn code. Useless!

Link: (Skips to 1:33:30 automatically)



Example #5

I found this in a thread about Lambda School (one of the many “developer bootcamps” in existence these days) getting sued for lying about its job placement rates and cutting down on its staff.

Two interesting comments here from people “in the know” about how bootcamps did not really help unless the student already had a math / science / STEM background. The first comment is striking because it is written by a former recruiter (who has the ability to see who does and doesn’t get the job).

That is to say, it is difficult to go from random guy off the street to professional software engineer from just a bootcamp alone (the implication here is that we can apply similar reasoning to online courses).

In this case, it wasn’t even that the math was being directly applied. A math / science background is important because it teaches you how to think properly. If 2 people can complete a bootcamp or online course, but only one has a STEM background and knows how to apply what they learned, that one will get the job, and the other will not.

Importantly, note that it’s not about the credentials, it’s purely about ability, as per the comments below.



Example #6

This is from a thread concerning Yann LeCun’s deep learning course at NYU. As usual, someone makes a comment that you don’t need such courses when you can just plug your data into Tensorflow like everyone else. Another, more experienced developer sets them straight.



Example #7

Hey, you guys have heard of Yann LeCun, right? Remember that guy? The guy who invented CNNs?

Let’s see what he has to say:

Math. Math. Oh and perhaps some more math.

That’s the gist of the advice to students interested in AI from Facebook’s Yann LeCun and Joaquin Quiñonero Candela

 who run the company’s Artificial Intelligence Lab and Applied Machine Learning group respectively.

Tech companies often advocate STEM (science, technology, engineering and math), but today’s tips are particularly pointed. The pair specifically note that students should eat their vegetables take Calc I, Calc II, Calc III, Linear Algebra, Probability and Statistics as early as possible.



This article will be updated over time. Keep checking back!

Go to comments

Time Series: How to convert AR(p) to VAR(1) and VAR(p) to VAR(1)

July 1, 2021

This is a very condensed post, mainly just so I could write down the equations I need for my Time Series Analysis course. 😉

However, it you find it useful – I am happy to hear that!

[Get 75% off the VIP version here]

Start with an AR(2):

$$ y_t = b + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \varepsilon_t $$


Suppose we create a vector containing both \( y_t \) and \( y_{t -1} \):

$$\begin{bmatrix} y_t \\ y_{t-1} \end{bmatrix}$$


We can write our AR(2) as follows:

$$\begin{bmatrix} y_t \\ y_{t-1} \end{bmatrix} = \begin{bmatrix} b \\ 0 \end{bmatrix} + \begin{bmatrix} \phi_1 & \phi_2 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} y_{t-1} \\ y_{t-2} \end{bmatrix} + \begin{bmatrix} \varepsilon_t \\ 0 \end{bmatrix}$$


Exercise: expand the above to see that you get back the original AR(2). Note that the 2nd line just ends up giving you \( y_{t-1} = y_{t-1} \).

The above is just a VAR(1)!

You can see this by letting:

$$ \textbf{z}_t = \begin{bmatrix} y_t \\ y_{t-1} \end{bmatrix}$$

$$ \textbf{b}’ = \begin{bmatrix} b \\ 0 \end{bmatrix} $$

$$ \boldsymbol{\Phi}’_1 = \begin{bmatrix} \phi_1 & \phi_2 \\ 1 & 0 \end{bmatrix} $$

$$ \boldsymbol{\eta}_t = \begin{bmatrix} \varepsilon_t \\ 0 \end{bmatrix}$$.

Then we get:

$$ \textbf{z}_t = \textbf{b}’ + \boldsymbol{\Phi}’_1\textbf{z}_{t-1} + \boldsymbol{\eta}_t$$

Which is a VAR(1).


Now let us try to do the same thing with an AR(3).

$$ y_t = b + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \phi_3 y_{t-3} + \varepsilon_t $$


We can write our AR(3) as follows:

$$\begin{bmatrix} y_t \\ y_{t-1} \\ y_{t-2} \end{bmatrix} = \begin{bmatrix} b \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} \phi_1 & \phi_2 & \phi_3 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} y_{t-1} \\ y_{t-2} \\ y_{t-3} \end{bmatrix} + \begin{bmatrix} \varepsilon_t \\ 0 \\ 0 \end{bmatrix}$$

Note that this is also a VAR(1).


Of course, we can just repeat the same pattern for AR(p).


The cool thing is, we can extend this to VAR(p) as well, to show that any VAR(p) can be expressed as a VAR(1).

Suppose we have a VAR(3).

$$ \textbf{y}_t = \textbf{b} + \boldsymbol{\Phi}_1 \textbf{y}_{t-1} + \boldsymbol{\Phi}_2 \textbf{y}_{t-2} + \boldsymbol{\Phi}_3 \textbf{y}_{t-3} + \boldsymbol{ \varepsilon }_t $$


Now suppose that we create a new vector by concatenating \( \textbf{y}_t \), \( \textbf{y}_{t-1} \), and \( \textbf{y}_{t-2} \). We get:

$$\begin{bmatrix} \textbf{y}_t \\ \textbf{y}_{t-1} \\ \textbf{y}_{t-2} \end{bmatrix} = \begin{bmatrix} \textbf{b} \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} \boldsymbol{\Phi}_1 & \boldsymbol{\Phi}_2 & \boldsymbol{\Phi}_3 \\ I & 0 & 0 \\ 0 & I & 0 \end{bmatrix} \begin{bmatrix} \textbf{y}_{t-1} \\ \textbf{y}_{t-2} \\ \textbf{y}_{t-3} \end{bmatrix} + \begin{bmatrix} \boldsymbol{\varepsilon_t} \\ 0 \\ 0 \end{bmatrix}$$

This is a VAR(1)!



Go to comments

What’s the difference between your courses and your consulting service?

May 7, 2021

This article is part of a series on common beginner questions.

It is related to another article on how to use the Q&A, but this covers the specific area of courses vs. consulting.

What is the issue?

Sometimes, students ask questions such as:

  • Can you provide feedback on my code?
  • Can you help me apply this algorithm on my dataset?
  • Can you teach me about this other algorithm which wasn’t covered in the course?
  • I have a task for XYZ. Can you advise me?
  • I need to write a function to do XYZ. Can you help me?
  • I got this question from a job interview / test / homework. Can you help?

If you ask these inside the Q&A for a course, the answer for all these questions is “no”.

Why? Because these are the types of questions I answer for my 1-on-1 / consulting clients.

Specifically, people hire me for 1-on-1s because:

  • The courses are very general, but they need help with more specific issues related to their personal work
  • They need help with their coding
  • They want to apply what they learned in the course to some outside dataset
  • They want to apply what they learned in the course but with tweaks to some algorithms
  • They want personal attention and tutelage to tackle their personal challenges and pain points
  • They have a specific task they need to complete and they need help
  • They need help with their job
  • They need personal career advice
  • They need to write a function but got stuck
  • They don’t have the prerequisites for a specific course and want to catch up fast

This is different from a course. In a course:

  • We teach a pre-defined list of topics, covering pre-defined areas (theory, followed by code)
  • We do not cater to any specific student
  • We will not create new lectures or cover new topics at a single student’s request
  • The course is made for a general audience, not a specific person
  • The prerequisites are set to be appropriate for the topic. For instance, if you want to learn ML, we are not going to teach basic math and programming. If you want to learn that, you have to do that by yourself before starting the course.


Unfairness due to cost

The main point is cost.

Courses can be provided at much lower cost (typically < $100) since they are scalable. The same course can given to many students around the world.

Personalized assistance and consulting costs much more (> $250 / hour) since it is not scalable. The assistance I provide for you only works for that singular instance. We are trading time for money.

Sidenote: the exact rate varies (generally increasing over time with inflation and my level of experience) so I have not posted an exact rate here.

Basically, it is unfair to provide the consulting service (which is much more costly) for the price of a course.


2 reasons:

1) It’s unfair to me, because you are not paying my stated rate. If you choose not to hire me, that’s fine. It would be like walking into a restaurant and asking for a 90% discount. I simply say “no thanks” and take the next customer.

2) It’s unfair to my clients, because they paid the fair price. It’s not fair to ask for the same service but pay a fraction of the cost that everyone else pays.


What does it mean to buy a course?

When you buy a course, you get the course.

The course consists of lecture videos, corresponding code samples, and perhaps other documents.

That’s generally what an online course (a “MOOC”) is.

It’s not a service, but a product.

As a general rule of thumb:

When you sign up for the course, any assistance I provide pertains to the actual course (i.e. the code, data, and algorithms contained within it) and nothing else.

For example, you wouldn’t pay for dinner at a restaurant and then ask the waiter to help you with your taxes. The waiter’s job is confined to serving you at the restaurant.

Similarly, when you sign up for my course, my job is confined to helping you with the course.

Sidenote: even this goes far beyond a typical MOOC. Most MOOCs simply offer no help or forum at all (or perhaps, you’re left to discuss your issue with other students, who may not know the correct solution).

You might think it’s a “grey area” if you ask me a question such as:

“Can you give me feedback on my implementation of logistic regression?”

Which might seem fair, especially if the course is about logistic regression.

But remember, I’ve already provided an implementation.

If you want to compare your implementation to the “official” one, that’s great! Learn from it.

Learn from the tips and tricks provided in the lectures.

Unfortunately, I can’t provide individualized assistance / personalized coaching for each student.

As much as I’d love to, I can’t check your work for you.

Remember, there are sometimes tens of thousands of students in a course.

It would be unfair to ask for personalized assistance, since that would mean all tens of thousands of students should be entitled to the same treatment.

The key is that for courses, assistance is not personalized, but consulting is personalized.


Help with your job / work

Additionally, it’s really unfair, and also unethical for you to ask for assistance with your work.

First of all, let’s assume your boss is paying you $100/h.

If you only paid $10 (flat fee) for the course, is it fair that I help you with your job while not being paid the same rate as you?

Why don’t I just go to your boss so they can pay me instead, and do a much better job than you?

Furthermore, it’s dishonest.

Why should your boss keep you on the team if you can’t perform your job on your own? What will happen on the next job?

You are lying to your teammates and you are lying to your boss about your true skill level.

Additionally, you would be taking credit for my work, which I obviously cannot allow.


Help with your projects or business

Another good example is webapps. Suppose that I helped you build an app, which you then turned into a business and made millions of dollars from. Is that fair to me?

Of course it is not. If you want me to assist in building your business idea, you’ll have to sign a contract and compensate me for my work.

You can’t simply say, “but you told us to build our own webapps as an exercise in the course” as an excuse.

Further, even if you promise me “I’m not using this for my business” there’s no way for me to verify that this is true.


Conflicts of interest and legal consequences

Another important reason why I can’t assist you with out-of-course content is because I don’t know what you’re using it for.

For example, a student previously asked for help on building a model to discriminate between people using criteria that is illegal in some places.

There’s no way for me to verify that your use-case does not violate any laws or existing contracts I have (as much as you promise me that it’s okay).


Teaching assistants have been trained on course content only

Another important reason I (or others) cannot assist you with out-of-course content is because the team I’ve hired to oversee the Q&A forums have only been trained on the course content.

They specialize in the course content, but may not have expertise in your particular question if it isn’t directly concerning the content of the course.


Summary Table

See the table below for a summary of key points:

Course Consulting
Cost Typically $10-$100, one time fee > $250 / h (hourly fee)
Scalability Scalable (1-to-many) Not scalable (1-to-1)
Topics Predefined (whatever I decided the course is about) Anything you want
What you can ask on the Q&A Anything about the course (existing code, data, algorithms) Anything you want
Feedback on your code No Yes
Help with your dataset No Yes
Help with your homework No Yes
Help with your job / work task No Yes
Ask about similar algorithms not in the course No Yes
Personalized No Yes



More questions: You encourage us to “code by ourselves”, so why can’t you give feedback?

The mistake is mixing up 2 separate things:

1) You coding yourself

2) Me providing feedback on your code

You can do #1 without #2. I “code by myself” all the time, and I improve my skills greatly using that method.

I don’t have a teacher – I teach myself. I experiment, I observe, and I learn from what I observed.

This is how you do it.

Again, feedback is a separate, personalized service, which I am happy to provide to those who actually signed up for that service.

If I wanted feedback on my code from someone more experienced than myself, I would do the same thing – hire a teacher specifically for that task.

But (for example), I wouldn’t buy Pattern Recognition & Machine Learning by Christopher Bishop (currently $100 hardcover) and then ask Mr. Bishop to help me with my implementations.

Why? A book is a product, not a service (much like a video course).


More questions: Are you taking new clients for your consulting service?

Currently, no.



I love teaching and I love helping people!

But unfortunately, I cannot allow people to take advantage of my generosity when it’s unfair to others or unfair to myself.

I have to be fair in consideration of all my students and clients simultaneously.

Go to comments

Coding Interview Questions – Bioinformatics – Finding a motif in DNA (Episode 18)

May 4, 2021

Hello all!

In this video, we’re going to do another coding interview question where I walk you through the process of:

– explaining the problem

– solving the problem

– analyzing the time complexity of the solution

These videos are designed to help you practice for future job interviews where you are required to answer technical whiteboard problems.

The problem we’ll solve can be viewed as a bioinformatics problem, but you don’t have to know anything about biology! Even those who only know basic coding should be able to solve this problem.

I hope this motivates you to realize that “bioinformatics” is not all that different from what you already know how to do – and therefore, contributing to this important and exciting field is totally within your reach.

Click here to watch –>

Go to comments

Reinforcement Learning Algorithms: Expected SARSA

April 6, 2021

In this post, we’ll extend our toolset for Reinforcement Learning by considering a new temporal difference (TD) method called Expected SARSA.

In my course, “Artificial Intelligence: Reinforcement Learning in Python“, you learn about SARSA and Q-Learning, two popular TD methods. We’ll see how Expected SARSA unifies the two.

Review of SARSA and Q-Learning

Let’s begin by reviewing the regular TD methods covered in my Reinforcement Learning course.

Your job in a reinforcement learning task is to program an agent (characterized by a policy) that interacts with an environment (characterized by state transition dynamics). A picture of this process (more precisely, this article discusses a Markov Decision Process) is shown below:

The agent reads in a state \( S_t \) and decides what action \( A_t \) to perform based on the state. This is called the policy and can be characterized by a probability distribution, \( \pi( A_t | S_t) \).

As the agent does this action, it changes the environment which results in the next state \( S_{t+1} \). A reward signal \( R_{t+1} \) is also given to the agent.

The goal of an agent is to maximize its sum of future rewards, called the return, \( G_t \). The discounted return is defined as:

$$ G_t \dot{=} R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + … + \gamma^{T – t – 1} R_T $$

One important relationship we can infer from the definition of the return is:

$$ G_t = R_{t + 1} + \gamma G_{t+1} $$

Since both the policy and environment transitions can be random, the return can also be random. Because of this, we can’t maximize “the” return (since there are many possible values the return can ultimately be), but only the expected return.

Taking the expected value of both sides and conditioning on the state \( S_t \), we arrive at the Bellman Equation:

$$ V_\pi(s) = E_\pi[ R_{t + 1} + \gamma V_\pi(S_{t+1}) | S_t = s] $$

If we condition on both the state \( S_t \) and the action \( A_t \), we get the Bellman Equation for the action-value:

$$ Q_\pi(s, a) = E_\pi \left[ R_{t + 1} + \gamma \sum_{a’} \pi(a’ |S_{t+1}) Q_\pi(S_{t+1}, a’) | S_t = s, A_t = a \right] $$

This will be the important relationship to consider when we learn about Expected SARSA.

Let’s go back a few steps.

The expected return given that the agent is in state \( S_t \) and performs action \( A_t \) at time \( t \) is given by the Q-table. Specifically:

$$ Q_\pi(s, a) = E_\pi[ G_t | S_t = s, A_t = a] $$

The Q-table can be used to determine what the best action will be since we can just choose whichever action \( a \) maximizes \( Q(s,a) \).

The problem is, we do not know \( Q(s,a) \)! Furthermore, we cannot calculate it directly since the expected value requires summing over the transition distribution \( p(s’, r | s, a) \).

Generally speaking, this is unknown. e.g. Imagine building a self-driving car.

The full Monte Carlo approach is to estimate the action-value using the sample average. i.e.

$$ Q_\pi(s, a) \approx \frac{1}{N}\sum_{i=1}^{N} G^{(i)}(s,a) $$

As you recall, it’s possible to convert the formula for the sample mean into a recursive update – something that looks a bit like gradient descent.

This is convenient so that we can update \(Q\) each time we receive a new sample without having to sum up all \(N\) values each time. If we did that, our computations would get longer and longer as \(N\) grows!

The recursive update looks like this:

$$ Q_\pi^{(N)}(s, a) \leftarrow Q_\pi^{(N-1)}(s, a) + \alpha \left[ G^{(N)}(s,a) – Q_\pi^{(N-1)}(s, a) \right] $$

This shows us how to get the \(N\)th estimate from the \(N-1\)th estimate. By setting \( \alpha = \frac{1}{N} \), we get exactly the sample mean (this was derived in the course).

The Temporal Difference approach uses “bootstrapping”, where we use existing values in \(Q_\pi(s, a)\) to estimate the expected return.

The SARSA update looks as follows:

$$ Q_\pi^{(N)}(s, a) \leftarrow Q_\pi^{(N-1)}(s, a) + \alpha \left[ r + \gamma Q_\pi^{(N-1)}(s’, a’) – Q_\pi^{(N-1)}(s, a) \right] $$

Essentially, we have replaced the old “target”:

$$ G(s,a) $$

with a new “target”:

$$ r + \gamma Q(s’, a’) $$

This should remind you of the right-hand side of the Bellman Equation for Q, with all the expected values removed.

What is the significance of this?

With Monte Carlo, we would have to wait until the entire episode is complete in order to compute the return (since the return is the sum of all future rewards).

For very long episodes, this means that it will take a long time before any updates are made to your agent.

For infinitely long episodes, it’s not possible to compute the return at all! (You’d have to wait an infinite amount of time to get an answer.)

TD learning uses the immediate reward \( r \) and “estimates the rest” using the existing value of \( Q \).

Therefore, your agent learns on each step, rather than waiting until the end of an episode.

The Q-Learning “target” is:

$$ r + \gamma \max_a’ Q(s’, a’) $$

The main difference between SARSA and Q-Learning is that SARSA is on-policy while Q-Learning is off-policy.

This is because Q-Learning always uses the max, even when that action wasn’t taken. SARSA uses action \( a’ \) – whichever action was actually chosen for the next step. Q-Learning learns the value of the policy in which you take the max (even though your behavior policy may be different – e.g. epsilon-greedy). SARSA learns the value of the behavior policy.

Expected SARSA

For Expected SARSA, the target is:

$$ r + \gamma \sum_{a’} \pi( a’ | s’) Q(s’, a’) $$

This can also be written as:

$$ r + \gamma E_\pi [Q(s’, a’) | s’] $$

Thus, Expected SARSA uses the expected action-value for the next state \( s’ \), over all actions \( a’ \), weighted according to the policy distribution.

Like regular SARSA, this should remind you of the Bellman Equation for Q (even more so than regular SARSA since it now properly sums over the policy distribution).

You can think of regular SARSA as merely drawing samples from the Expected SARSA target distribution. Put a different way, SARSA does what Expected SARSA does, in expectation.

So how does Expected SARSA unify SARSA and Q-Learning?

In fact, Q-Learning is merely a special case of Expected SARSA.

First, recognize that Expected SARSA can be on-policy or off-policy. In the off-policy case, one could act according to some behavior policy \( b(a | s) \), while the target is computed according to the target policy \( \pi( a | s) \).

If \( b(a | s) \) is epsilon-greedy while \( \pi(a | s) \) is greedy, then Expected SARSA is equivalent to Q-Learning.

On the other hand, Expected SARSA generalizes SARSA as well, as discussed above.

Practical Considerations

1) Aside from all the theory above, the update itself is actually quite simple to implement. In some cases, you may find that it works better than SARSA or Q-Learning. Thus, it is a worthy addition to your RL toolbox and could form a part of your strategy for building agents to solve RL problems.

2) It’s slower than SARSA and Q-Learning since it requires summing over the action space on each update. So, you may have to decide whether the additional computation requirements are worth whatever increase in performance you observe.

Go to comments

Monte Carlo with Importance Sampling for Reinforcement Learning

March 7, 2021

In this post, we’ll extend our toolset for Reinforcement Learning by considering the Monte Carlo method with importance sampling.

In my course, “Artificial Intelligence: Reinforcement Learning in Python“, you learn about the Monte Carlo method. But that’s just the beginning. There is still more that can be done to improve the agent’s learning capabilities.

Review of Monte Carlo for Reinforcement Learning

Let’s begin by reviewing the regular Monte Carlo method covered in my Reinforcement Learning course.

Your job in a reinforcement learning task is to program an agent (characterized by a policy) that interacts with an environment (characterized by state transition dynamics). A picture of this process (more precisely, this article discusses a Markov Decision Process) is shown below:

The agent reads in a state \( S_t \) and decides what action \( A_t \) to perform based on the state. This is called the policy and can be characterized by a probability distribution, \( \pi( A_t | S_t) \).

As the agent does this action, it changes the environment which results in the next state \( S_{t+1} \). A reward signal \( R_{t+1} \) is also given to the agent.

The goal of an agent is to maximize its sum of future rewards, called the return, \( G_t \). The discounted return is defined as:

$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + … + \gamma^{T – t – 1} R_T $$

Since both the policy and environment transitions can be random, the return can also be random. Because of this, we can’t maximize “the” return (since there are many possible values the return can ultimately be), but only the expected return.

The expected return given that the agent is in state \( S_t \) and performs action \( A_t \) at time \( t \) is given by the Q-table. Specifically:

$$ Q_\pi(s, a) = E_\pi[ G_t | S_t = s, A_t = a] $$

The Q-table can be used to determine what the best action will be since we can just choose whichever action \( a \) maximizes \( Q(s,a) \).

The problem is, we do not know \( Q(s,a) \)! Furthermore, we cannot calculate it directly since the expected value requires summing over the transition distribution \( p(s’, r | s, a) \).

Generally speaking, this is unknown. e.g. Imagine building a self-driving car.

The Monte Carlo approach is to estimate the action-value using the sample average. i.e.

$$ Q_\pi(s, a) \approx \frac{1}{N}\sum_{i=1}^{N} G^{(i)}(s,a) $$

Where \( G^{(i)}(s,a) \) was the sample return when the agent was in state \( s \) and performed action \( a \) during the \( i \)’th episode.

Put simply: play a bunch of episodes, collect all the state-action-reward sequences, calculate the returns (by summing up the rewards), and then compute the average return for each state-action pair.


How can we ensure that we visit every state-action pair so that the whole Q-table is filled up with a sufficient number of samples?

Practically, we usually employ some kind of exploration strategy, such as epsilon-greedy. With epsilon-greedy, we perform the optimal action \( 1-\varepsilon \) of the time, and we pick a random action \( \varepsilon \) of the time. So \( \varepsilon \) is the probability of exploration.

The problem with this approach is that it leads to a suboptimal policy. Why? It means that approximately \( \varepsilon \) of the time, you are going to do something suboptimal!

(Note: it’s not exactly \( \varepsilon \) since choosing an action randomly can still lead to choosing the optimal action by chance.)

This is where we transition to the main topic of this article, which is how Importance Sampling can help us overcome this problem.

Monte Carlo with epsilon-greedy exploration is called an on-policy control method, because the action-value (Q-table) being estimated corresponds to the policy that the agent is following.

On the other hand, off-policy methods allow the agent to act according to one policy (called the behavior policy), while the action-value is computed for a different, eventually optimal policy (called the target policy).

Henceforth we will denote the target policy as \( \pi( a | s) \) and the behavior policy as \( b(a | s) \).

Importance Sampling

Suppose that we would like to estimate the expected value of some function \( s \) of a random variable \( X \) under some distribution \( \pi \). We write this as:

$$ E_\pi[ s(X) ] $$

If we know \( \pi \), then (assuming \( X \) is a discrete random variable) this can be computed as:

$$ E_\pi[ s(X) ] = \sum_x \pi(x)s(x) $$

If we don’t know \( \pi \), but \( X \) is sampled according to \( \pi \), then this expectation can be estimated by using the sample mean:

$$ E_\pi[ s(X) ] \approx \frac{1}{N}\sum_{i=1}^N s(X_i) $$

Now suppose that something prevents us from gathering samples according to the distribution \( \pi \), but it’s still possible to gather samples from a different distribution \( b \). i.e. We want \( X \sim \pi \) but we have \( X \sim b \) instead.

In this case, the above sample average does not give us the desired expectation. We would be estimating \( E_b[ s(X)] \) instead of \( E_\pi[ s(X) ] \).

The importance sampling solution is found by recognizing the following equalities:

&& E_\pi \left[ s(X) \right] \\
&=& \sum_x \pi(x)s(x) \\
&=& \sum_x \pi(x)s(x)\frac{b(x)}{b(x)} \\
&=& \sum_x b(x)s(x)\frac{\pi(x)}{b(x)} \\
&=& E_b \left[ s(X)\frac{\pi(X)}{b(X)} \right]

This tells us that it’s possible to estimate the expectation under \( \pi \) even when the samples are drawn according to \( b \). All we have to do is multiply by the importance sampling ratio, \( \frac{\pi(X)}{b(X)} \).

The only requirement is that \( b(x) \) is not 0 when \( \pi(x) \) is not 0. This is called “coverage”.

Applying Importance Sampling to Reinforcement Learning

In reinforcement learning, the return \( G_t \) is generated by acting according to the behavior policy \( b(a | s) \) with transition dynamics \( p(s’, r | s, a) \). But we would like to know the expectation of \( G_t \) under the target policy \( \pi(a | s) \) with the same transition dynamics.

In this case, the importance sampling ratio is a bit more complicated but can still be derived. \( G_t \) is a sample from the distribution of \( p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b \mspace{5mu} \forall \tau) \).

Basically this says: “the distribution of all the actions and states that happened after arriving in state \( S_t \), following the policy \( b \)”.

The distribution we want the expectation with respect to is the same thing, but with actions drawn according to \( \pi \). i.e. \( p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi \mspace{5mu} \forall \tau) \).

Thanks to the Markov property these distributions are easy to expand.

$$ p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b) = \prod_{\tau=t}^{T-1} b(A_\tau | S_\tau)p(S_{\tau+1} | S_\tau, A_\tau)$$

$$ p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi) = \prod_{\tau=t}^{T-1} \pi(A_\tau | S_\tau)p(S_{\tau+1} | S_\tau, A_\tau)$$

The importance sampling ratio is then just:

$$ \frac{p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi)}{p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b)} = \prod_{\tau=t}^{T-1} \frac{\pi(A_\tau | S_\tau)}{b(A_\tau | S_\tau)}$$

The transition dynamics cancel out because they are the same on both top and bottom.

This is convenient, because we know \( \pi \) and we know \( b \), but we do not know \( p \) (which is why we have to use Monte Carlo in the first place).

Let’s define this importance sampling ratio using the symbol \( \rho \).

$$ \rho_{t:T-1} \dot{=} \prod_{\tau=t}^{T-1} \frac{\pi(A_\tau | S_\tau)}{b(A_\tau | S_\tau)}$$

Using this importance sampling ratio, we can estimate \( Q_\pi(s,a) \) even though we are acting according to a different policy \( b \) and using the returns generated from this other policy.

$$ Q_\pi(s,a) \approx \frac{1}{N}\sum_{i=1}^N \rho^{(i)}(s,a) G^{(i)}(s,a) $$

Where \( G^{(i)}(s,a) \) was the sample return when the agent was in state \( s \) and performed action \( a \) during the \( i \)’th episode, and \( \rho^{(i)}(s,a) \) was the corresponding importance sampling ratio.

Importance Sampling for Monte Carlo Implementation

At this point, you know all the theory. All that you have to do now is plug in the above importance sampling ratio in the appropriate places in your existing Monte Carlo code, and you’ll be doing Monte Carlo with importance sampling.

Here are some important considerations.

Like the return \( G_t \), the importance sampling ratio is defined in terms of future values. i.e. the importance sampling ratio at time \( t \) depends on the probabilities of the behavior and target policies at time \( t+1, t+2, … \).

Therefore, it would make sense to loop backwards in time to compute this ratio, just like we loop backwards in time to compute the return.

Just like the return, the importance sampling ratio can be computed recursively.

Finally, you’ll recall that for the regular unweighted sample mean, it’s possible to perform constant-time updates every time we collect a new sample, instead of naively summing up all the samples and dividing by N.

$$ Q^{(i)}(s,a) \leftarrow Q^{(i-1)}(s,a) – \frac{1}{i}(G^{(i)}(s,a) – Q^{(i-1)}(s,a)) $$

Similarly, it’s possible to express the weighted sample mean update using a similar constant-time operation. i.e.

$$ Q^{(i)}(s,a) \leftarrow Q^{(i-1)}(s,a) – \alpha^{(i)}(G^{(i)}(s,a) – Q^{(i-1)}(s,a)) $$

As an exercise, try to derive what \( \alpha^{(i)} \) should be.

Last point: I haven’t discussed weighted importance sampling, which can be used to reduce the variance of the estimate. The weighted importance sampling estimate looks like this:

$$ Q_\pi(s,a) \approx \frac{\sum_{i=1}^N \rho^{(i)}(s,a) G^{(i)}(s,a)}{ \sum_{i=1}^N \rho^{(i)}(s,a) } $$


Let’s review why this is different from regular Monte Carlo.

Regular Monte Carlo (what I covered in my Reinforcement Learning course) is an on-policy control method.

We use epsilon-greedy because exploration is required to collect enough samples for all state-action pairs. Epsilon-greedy is suboptimal by definition. Our Q-table and final policy will thus be suboptimal.

What might be better is an off-policy control method, where we act according to a behavior policy which allows for exploration, but compute the Q-table according to the target greedy policy (the optimal policy).

#reinforcement learning

Go to comments

Deep Learning and Artificial Intelligence Newsletter

Get discount coupons, free machine learning material, and new course announcements