# Convert a Time Series Into an Image with Gramian Angular Fields and Markov Transition Fields

August 30, 2021

In my latest course (Time Series Analysis), I made subtle hints in the section on Convolutional Neural Networks that instead of using 1-D convolutions on 1-D time series, it is possible to convert a time series into an image and use 2-D convolutions instead.

CNNs with 2-D convolutions are the “typical” kind of neural network used in deep learning, which normally are used on images (e.g. ImageNet, object detection, segmentation, medical imaging and diagnosis, etc.)

In this article, we will look at 2 ways to convert a time series into an image:

1. Gramian Angular Field
2. Markov Transition Field

## Gramian Angular Field

The Gramian Angular Field is quite involved mathematically, so this article will discuss the intuition only, along with the code.

Those interesting in all the gory details are encouraged to read the paper, titled “Encoding Time Series as Images for Visual Inspection and Classification Using Tiled Convolutional Neural Networks” by Zhiguang Wang and Tim Oates.

We’ll build the intuition in a series of steps.

Let us begin by recalling that the dot product or inner product is a measure of similarity between two vectors.

$$\langle a, b\rangle = \lVert a \rVert \lVert b \rVert \cos \theta$$

Where $$\theta$$ is the angle between $$a$$ and $$b$$.

Ignoring the magnitude of the vectors, if the angle between them is small (i.e. close to 0) then the cosine of that angle will be nearly 1. If the angle is perpendicular, the cosine of the angle is 0. If the two vectors are pointing in opposite directions, then the cosine of the angle will be -1.

The Gram Matrix is just the repeated application of the inner product between every vector in a set of vectors, and every other vector in that same set of vectors.

i.e. Suppose that we store a set of column vectors in a matrix called $$X$$.

The Gram Matrix is:

$$G = X^TX$$

This expands to:

$$G = \begin{bmatrix} \langle x_1, x_1 \rangle & \langle x_1, x_2 \rangle & … & \langle x_1, x_N \rangle \\ \langle x_2, x_1 \rangle & \langle x_2, x_2 \rangle & … & \langle x_2, x_N \rangle \\ … & … & … & … \\ \langle x_N, x_1 \rangle & \langle x_N, x_2 \rangle & … & \langle x_N, x_N \rangle \end{bmatrix}$$

In other words, if we think of the inner product as the similarity between two vectors, then the Gram Matrix just gives us the pairwise similarity between every vector and every other vector.

Note that the Gramian Angular Field (GAF) does not apply the Gram Matrix directly (in fact, each value of the time series is a scalar, not a vector).

The first step in computing the GAF is to normalize the time series to be in the range [-1, +1].

Let’s assume we are given a time series $$X = \{x_1, x_2, …, x_N \}$$.

The normalized values are denoted by $$\tilde{x_i}$$.

The second step is to convert each value in the normalized time series into polar coordinates.

We use the following transformation:

$$\phi_i = \arccos \tilde{x_i}$$

$$r_i = \frac{t_i}{N}$$

Where $$t_i \in \mathbb{N}$$ represents the timestamp of data point $$x _i$$.

Finally, the GAF method defines its own “special” inner product as:

$$\langle x_1, x_2 \rangle = \cos(\phi_1 + \phi_2)$$

From here, the above formula for $$G$$ still applies (except using $$\tilde{X}$$ instead of $$X$$, and using the custom inner product instead of the usual version).

Here is an illustration of the process:

So why use the GAF?

Like the original Gram Matrix, it gives you a “picture” (no pun intended) of the relationship between every point and every other point in the time series.

That is, it displays the temporal correlation structure in the time series.

Here’s how you can use it in code.

Firstly, you need to install the pyts library. Then, run the following code on a time series of your choice:

Note that the library allows you to rescale the image with the image_size argument.

As an exercise, try using this method instead of the 1-D CNNs we used in the course and compare their performance!

## Markov Transition Field

The Markov Transition Field (MTF) is another method of converting a time series into an image.

The process is a bit simpler than that of the GAF.

If you have taken any of my courses which involve Markov Models (like Natural Language Processing, or HMMs) you should feel right at home.

Let’s assume we have an N-length time series.

We begin by putting each value in the time series into quantiles (i.e. we “bin” each value).

For example, if we use quartiles (4 bins), the smallest 25% of values would define the boundaries of the first quartile, the second smallest 25% of values would define the boundaries of the second quartile, etc.

We can think of each bin as a ‘state’ (using Markov model terminology).

Intuitively, we know that what we’d like to do when using Markov models is to form the state transition matrix.

This matrix has the values:

$$A_{ij} = P(s_t = j | s_{t-1} = i)$$

That is, $$A_{ij}$$ is the probability of transitioning from state i to state j.

As usual, we estimate this value by maximum likelihood. ( $$A_{ij}$$ is the count of transitions from i to j, divided by the total number of times we were in state i).

Note that if we have $$Q$$ quantiles (i.e. we have $$Q$$ “states”), then $$A$$ is a $$Q \times Q$$ matrix.

The MTF follows a similar concept.

The MTF (denoted by $$M$$) is an $$N \times N$$ matrix where:

$$M_{kl} = A_{q_k q_l}$$

And where $$q_k$$ is the quantile (“bin”) for $$x_k$$, and $$q_l$$ is the quantile for $$x_l$$.

Note: I haven’t re-used the letters i and j to index $$M$$, which most resources do and it’s super confusing.

Do not mix up the indices for $$M$$ and $$A$$! The indices in $$A$$ refer to states. The indices for $$M$$ are temporal.

$$A_{ij}$$ is the probability of transitioning from state i to state j.

$$M_{kl}$$ is the probability of a one-step transition from the bin for $$x_k$$, to the bin for $$x_l$$.

That is, it looks at $$x_k$$ and $$x_l$$, which are 2 points in the time series at arbitrary time steps $$k$$ and $$l$$.

$$q_k$$ and $$q_l$$ are the corresponding quantiles.

$$M_{kl}$$ is then just the probability that we saw a direct one-step (i.e. Markovian) transition from $$q_k$$ to $$q_l$$ in the time series.

So why use the MTF?

It shows us how related 2 arbitrary points in the time series are, relative to how often they appear next to each other in the time series.

Here’s how you can use it in code.

Note that the library allows you to rescale the image with the image_size argument.

As an exercise, try using this method instead of the 1-D CNNs we used in the course and compare their performance

Enjoy!

# Should you study the theory behind machine learning?

August 23, 2021

In this post, I want to discuss why you should not study the theory behind machine learning.

This may surprise some of you, since my courses can appear to be more “theoretical” than other ML courses on popular websites such as Udemy.

However, that is not the kind of “theory” I am talking about.

Most popular courses in ML don’t look at any math at all.

They are popular precisely for this reason: lack of math makes them accessible to the average Joe.

This does a disservice to you students, because you end up not having any solid understanding about how the algorithm works.

You may end up:

• doing things that don’t make sense, due to that lack of understanding.
• only being able to copy code from others, but not write any code yourself.
• not knowing how to apply algorithms to new kinds of data, without someone showing you how first.

For more discussion on that, see my post: “Why do you need math for machine learning and deep learning?

But let’s make this clear: math != theory.

When we look at math in my courses, we only look at the math needed to derive the algorithm and understand how it works at an intuitive level.

Yes, believe it or not, we are using math to improve our intuition.

This is despite what many beginners might think. When they see math, they automatically assume “math” = “not intuitive”, and that “intuitive” = “pictures, animations, and purposely avoiding math”.

That’s OK if you want to read a news article in the NY Times about ML, but not when you want to be a practitioner of ML.

Those are 2 different levels of “intuition” (layman vs. practitioner).

To see an extreme example of this, one need not look any further than Albert Einstein. Einstein was great at communicating his ideas to the public. Everyone can easily understand the layman interpretation of general relativity (mass bends space and time). But this is not the same as being a practitioner of relativistic physics.

Everyone has seen this picture and understands what it means at a high level. But does that mean you are a physicist or that you can “do physics”?

Anyway, that was just an aside so we don’t confuse “math used for intuition” and “layman intuition” and “theory”. These are 3 separate things. Just because you’re looking at some math, does not automatically imply you’re looking at “theory”.

What do we mean by “theory”?

Here’s a simple question to consider. Why does gradient descent work?

Despite the fact that we have used gradient descent in many of my courses, and derived the gradient descent update rules for neural networks, SVMs, and other models, we have never discussed why it works.

And that’s OK!

The “mathematical intuition” is enough.

But let’s get back to the question of this article: Why is the Lazy Programmer saying we should not study theory?

Well, this is the kind of “theory” that gets so deep, it:

• Does not produce any near-term gains in your work
• Requires a very high level of math ability (e.g. real analysis, optimization, dynamical systems)
• Is on the cutting-edge of understanding, and thus very difficult, likely to be disputed or even superseded in the near future

Case in point: although we have been using gradient descent for years in my courses (and decades before that in general), our understanding is still not yet complete.

Here’s an article that just came out this year on gradient descent (August 2021): “Computer Scientists Discover Limits of Major Research Algorithm“.

Here’s a direct link to the corresponding paper, called “The Complexity of Gradient Descent: CLS = PPAD ∩ PLS”: https://arxiv.org/abs/2011.01929

There will be more papers on these “theory” topics in the years to come.

My advice is not to go down this path, unless you really enjoy it, you are doing graduate research (e.g. PhD-level), you don’t mind if ideas you spent years and years working on might be proven incorrect, and you have a very high level of math ability in subjects like real analysis, optimization, and dynamical systems.

# Predicting Stock Prices with Facebook Prophet

August 3, 2021

Prophet is Facebook’s library for time series forecasting. It is mainly geared towards business datasets (e.g. predicting adspend or CPU usage), but a natural question that comes up with my students whenever we talk about time series is: “can it predict stock prices?”

In this article, I will discuss how to use FB Prophet to predict stock prices, and I’ll also show you what not to do (things I’ve seen in other popular blogs). Furthermore, we will benchmark the Prophet model with the naive forecast, to check whether or not one would really want to use this.

Note: This is an excerpt from my full VIP course, “Time Series Analysis, Forecasting, and Machine Learning“. If you want the code for this example, along with many, many other code examples on stock prices, sales data, and smartphone data, get the course!

The Prophet section will be part of the VIP version only, so get it now while the VIP coupon is still active!

## How does Prophet work?

The Prophet model is a 3 component, non-autoregressive time series model. Specifically:

$$y(t) = g(t) + s(t) + h(t) + \varepsilon(t)$$

The Prophet model is not autoregressive, like ARIMA, exponential smoothing, and the other methods we study in a typical time series course (including my own).

The 3 components are:

1) The trend $$g(t)$$ which can be either linear or logistic.

2) The seasonality $$s(t)$$, modeled using a Fourier series.

3) The holiday component $$h(t)$$, which is essentially a one-hot vector “dotted” with a vector of weights, each representing the contribution from their respective holiday.

## How to use Prophet for predicting stock prices

In my course, we do 3 experiments. Our data is Google’s stock price from approximately 2013-2018, but we only use the first 2 years as training data.

The first experiment is “plug-and-play” into Prophet with the default settings.

Here are the results:

Unfortunately, Prophet mistakenly believes there is a weekly seasonal component, which is the reason for the little “hairs” in the forecast.

When we plot the components of the model, we see that Prophet has somehow managed to find some weekly seasonality.

Of course, this is completely wrong! The model believes that the stock price increases on the weekends, which is highly unlikely because we don’t have any data for the weekend.

The second experiment is an example of what not to do. I saw this in every other popular blog, which is yet another “data point” that should convince you not to trust these popular data science blogs you find online (except for mine, obviously).

In this experiment, we set daily_seasonality to True in the model constructor.

Here are the results.

It seems like those weird little “hairs” coming from the weekly seasonal component have disappeared.

“The Lazy Programmer is wrong!” you may proclaim.

However, this is because you may not understand what daily seasonality really means.

Let’s see what happens when we plot the components.

This plot should make you very suspicious. Pay attention to the final chart.

“Daily seasonality” pertains to a pattern that repeats everyday with sub-daily changes.

This cannot be the case, because our data only has daily granularity!

Lesson: don’t listen to those “popular” blogs.

For experiment 3, we set weekly seasonality to False. Alternatively, you could try playing around with the priors.

Here are the results.

Notice that the “little hairs” are again not present.

## Is this model actually good?

Just because you can make a nice chart, does not mean you have done anything useful.

In fact, you see the exact same mistakes in those blog articles and terrible Udemy courses promising to “predict stock prices with LSTMs” (which I will call out every chance I get).

One of the major mistakes I see in nearly every blog post about predicting stock prices is that they don’t bother to compare it to a benchmark. And as you’ll see, the benchmark for stock prices is quite a low bar – there is no reason not to compare.

Your model is only useful if it can beat the benchmark.

For stock price predictions, the benchmark is typically the naive forecast, which is the optimal forecast for a random walk.

Random walks are often used as a model for stock prices since they share some common attributes.

For those unfamiliar, the naive forecast is simply where you predict the last-known value.

Example: If today’s price on July 5 is $200 and I want to make a forecast with a 5-day horizon, then I will predict$200 for July 6, $200 for July 7, …, and$200 for July 10.

I won’t bore you with the code (although it’s included in the course if you’re interested), but the answer is: Prophet does not beat the naive forecast.

In fact, it does not beat the naive forecast on any horizon I tried (5 days, 30 days, 60 days).

Sidenote: it’d be a good exercise to try 1 day as well.

Are stock prices really random walks? Although this particular example provides evidence supporting the random walk hypothesis, in my course, the GARCH section will provide strong evidence against it! Again, it’s all explained in my latest course, “Time Series Analysis, Forecasting, and Machine Learning“. Only the VIP version will contain the sections on Prophet, GARCH, and other important tools.

The VIP version is intended to be limited-time only, and the current coupon expires in less than one month!

Get your copy today while you still can.

# Why do you need math for machine learning and deep learning?

July 9, 2021

In this article, I will demonstrate why math is necessary for machine learning, data science, deep learning, and AI.

Most of my students have already heard this from me countless times. College-level math is a prerequisite for nearly all of my courses already.

Perhaps you may believe I am biased, because I’m the one teaching these courses which require all this math.

It would seem that I am just some crazy guy, making things extra hard for you because I like making things difficult.

WRONG.

You’ve heard it from me many times. Now you’ll hear it from others.

## Example #1

Let’s begin with one of the most famous professors in ML, Daphne Koller, who co-founded Coursera.

In this clip, Lex Fridman asks what advice she would have for those interested in beginning a journey into AI and machine learning.

One important thing she mentions, which I have seen time and time again in my own experience, is that those without typical prerequisite math backgrounds often make mistakes and do things that don’t make sense.

She’s being nice here, but I’ve met many of these folks who not only have no idea that what they are doing does not make sense, they also tend to be overly confident about it!

Then it becomes a burden for me, because I have to put in more effort explaining the basics to you just to convince you that you are wrong.

For that reason, I generally advise against hiring people for ML roles if they do not know basic math.

## Example #2

I enjoyed this strongly worded Reddit comment.

Original post:

Top comment:

## Example #3

Not exactly machine learning, but very related field: quant finance.

In fact, many students taking my courses dream about applying ML to finance.

Well, it’s going to be pretty hard if you can’t pass these interview questions.

http://www.math.kent.edu/~oana/math60070/InterviewProblems.pdf

Think about this logically: All quants who have a job can pass these kinds of interview questions. But you cannot. How well do you think you will do compared to them?

## Example #4

Entrepreneur and angel investor Naval Ravikant explains why deriving (what we do in all of my in-depth machine learning courses) is much more important than memorizing on the Joe Rogan Experience.

Most beginner-level Udemy courses don’t derive anything – they just tell you random facts about ML algorithms and then jump straight to the usual 3 lines of scikit-learn code. Useless!

# Time Series: How to convert AR(p) to VAR(1) and VAR(p) to VAR(1)

July 1, 2021

This is a very condensed post, mainly just so I could write down the equations I need for my Time Series Analysis course. 😉

However, it you find it useful – I am happy to hear that!

[Get 75% off the VIP version here]

$$y_t = b + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \varepsilon_t$$

Suppose we create a vector containing both $$y_t$$ and $$y_{t -1}$$:

$$\begin{bmatrix} y_t \\ y_{t-1} \end{bmatrix}$$

We can write our AR(2) as follows:

$$\begin{bmatrix} y_t \\ y_{t-1} \end{bmatrix} = \begin{bmatrix} b \\ 0 \end{bmatrix} + \begin{bmatrix} \phi_1 & \phi_2 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} y_{t-1} \\ y_{t-2} \end{bmatrix} + \begin{bmatrix} \varepsilon_t \\ 0 \end{bmatrix}$$

Exercise: expand the above to see that you get back the original AR(2). Note that the 2nd line just ends up giving you $$y_{t-1} = y_{t-1}$$.

The above is just a VAR(1)!

You can see this by letting:

$$\textbf{z}_t = \begin{bmatrix} y_t \\ y_{t-1} \end{bmatrix}$$

$$\textbf{b}’ = \begin{bmatrix} b \\ 0 \end{bmatrix}$$

$$\boldsymbol{\Phi}’_1 = \begin{bmatrix} \phi_1 & \phi_2 \\ 1 & 0 \end{bmatrix}$$

$$\boldsymbol{\eta}_t = \begin{bmatrix} \varepsilon_t \\ 0 \end{bmatrix}$$.

Then we get:

$$\textbf{z}_t = \textbf{b}’ + \boldsymbol{\Phi}’_1\textbf{z}_{t-1} + \boldsymbol{\eta}_t$$

Which is a VAR(1).

Now let us try to do the same thing with an AR(3).

$$y_t = b + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \phi_3 y_{t-3} + \varepsilon_t$$

We can write our AR(3) as follows:

$$\begin{bmatrix} y_t \\ y_{t-1} \\ y_{t-2} \end{bmatrix} = \begin{bmatrix} b \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} \phi_1 & \phi_2 & \phi_3 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} y_{t-1} \\ y_{t-2} \\ y_{t-3} \end{bmatrix} + \begin{bmatrix} \varepsilon_t \\ 0 \\ 0 \end{bmatrix}$$

Note that this is also a VAR(1).

Of course, we can just repeat the same pattern for AR(p).

The cool thing is, we can extend this to VAR(p) as well, to show that any VAR(p) can be expressed as a VAR(1).

Suppose we have a VAR(3).

$$\textbf{y}_t = \textbf{b} + \boldsymbol{\Phi}_1 \textbf{y}_{t-1} + \boldsymbol{\Phi}_2 \textbf{y}_{t-2} + \boldsymbol{\Phi}_3 \textbf{y}_{t-3} + \boldsymbol{ \varepsilon }_t$$

Now suppose that we create a new vector by concatenating $$\textbf{y}_t$$, $$\textbf{y}_{t-1}$$, and $$\textbf{y}_{t-2}$$. We get:

$$\begin{bmatrix} \textbf{y}_t \\ \textbf{y}_{t-1} \\ \textbf{y}_{t-2} \end{bmatrix} = \begin{bmatrix} \textbf{b} \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} \boldsymbol{\Phi}_1 & \boldsymbol{\Phi}_2 & \boldsymbol{\Phi}_3 \\ I & 0 & 0 \\ 0 & I & 0 \end{bmatrix} \begin{bmatrix} \textbf{y}_{t-1} \\ \textbf{y}_{t-2} \\ \textbf{y}_{t-3} \end{bmatrix} + \begin{bmatrix} \boldsymbol{\varepsilon_t} \\ 0 \\ 0 \end{bmatrix}$$

This is a VAR(1)!

# Coding Interview Questions – Bioinformatics Rosalind.info – Finding a motif in DNA (Episode 18)

May 4, 2021

Hello all!

In this video, we’re going to do another coding interview question where I walk you through the process of:

– explaining the problem

– solving the problem

– analyzing the time complexity of the solution

These videos are designed to help you practice for future job interviews where you are required to answer technical whiteboard problems.

The problem we’ll solve can be viewed as a bioinformatics problem, but you don’t have to know anything about biology! Even those who only know basic coding should be able to solve this problem.

I hope this motivates you to realize that “bioinformatics” is not all that different from what you already know how to do – and therefore, contributing to this important and exciting field is totally within your reach.

# Reinforcement Learning Algorithms: Expected SARSA

April 6, 2021

In this post, we’ll extend our toolset for Reinforcement Learning by considering a new temporal difference (TD) method called Expected SARSA.

In my course, “Artificial Intelligence: Reinforcement Learning in Python“, you learn about SARSA and Q-Learning, two popular TD methods. We’ll see how Expected SARSA unifies the two.

## Review of SARSA and Q-Learning

Let’s begin by reviewing the regular TD methods covered in my Reinforcement Learning course.

Your job in a reinforcement learning task is to program an agent (characterized by a policy) that interacts with an environment (characterized by state transition dynamics). A picture of this process (more precisely, this article discusses a Markov Decision Process) is shown below:

The agent reads in a state $$S_t$$ and decides what action $$A_t$$ to perform based on the state. This is called the policy and can be characterized by a probability distribution, $$\pi( A_t | S_t)$$.

As the agent does this action, it changes the environment which results in the next state $$S_{t+1}$$. A reward signal $$R_{t+1}$$ is also given to the agent.

The goal of an agent is to maximize its sum of future rewards, called the return, $$G_t$$. The discounted return is defined as:

$$G_t \dot{=} R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + … + \gamma^{T – t – 1} R_T$$

One important relationship we can infer from the definition of the return is:

$$G_t = R_{t + 1} + \gamma G_{t+1}$$

Since both the policy and environment transitions can be random, the return can also be random. Because of this, we can’t maximize “the” return (since there are many possible values the return can ultimately be), but only the expected return.

Taking the expected value of both sides and conditioning on the state $$S_t$$, we arrive at the Bellman Equation:

$$V_\pi(s) = E_\pi[ R_{t + 1} + \gamma V_\pi(S_{t+1}) | S_t = s]$$

If we condition on both the state $$S_t$$ and the action $$A_t$$, we get the Bellman Equation for the action-value:

$$Q_\pi(s, a) = E_\pi \left[ R_{t + 1} + \gamma \sum_{a’} \pi(a’ |S_{t+1}) Q_\pi(S_{t+1}, a’) | S_t = s, A_t = a \right]$$

This will be the important relationship to consider when we learn about Expected SARSA.

Let’s go back a few steps.

The expected return given that the agent is in state $$S_t$$ and performs action $$A_t$$ at time $$t$$ is given by the Q-table. Specifically:

$$Q_\pi(s, a) = E_\pi[ G_t | S_t = s, A_t = a]$$

The Q-table can be used to determine what the best action will be since we can just choose whichever action $$a$$ maximizes $$Q(s,a)$$.

The problem is, we do not know $$Q(s,a)$$! Furthermore, we cannot calculate it directly since the expected value requires summing over the transition distribution $$p(s’, r | s, a)$$.

Generally speaking, this is unknown. e.g. Imagine building a self-driving car.

The full Monte Carlo approach is to estimate the action-value using the sample average. i.e.

$$Q_\pi(s, a) \approx \frac{1}{N}\sum_{i=1}^{N} G^{(i)}(s,a)$$

As you recall, it’s possible to convert the formula for the sample mean into a recursive update – something that looks a bit like gradient descent.

This is convenient so that we can update $$Q$$ each time we receive a new sample without having to sum up all $$N$$ values each time. If we did that, our computations would get longer and longer as $$N$$ grows!

The recursive update looks like this:

$$Q_\pi^{(N)}(s, a) \leftarrow Q_\pi^{(N-1)}(s, a) + \alpha \left[ G^{(N)}(s,a) – Q_\pi^{(N-1)}(s, a) \right]$$

This shows us how to get the $$N$$th estimate from the $$N-1$$th estimate. By setting $$\alpha = \frac{1}{N}$$, we get exactly the sample mean (this was derived in the course).

The Temporal Difference approach uses “bootstrapping”, where we use existing values in $$Q_\pi(s, a)$$ to estimate the expected return.

The SARSA update looks as follows:

$$Q_\pi^{(N)}(s, a) \leftarrow Q_\pi^{(N-1)}(s, a) + \alpha \left[ r + \gamma Q_\pi^{(N-1)}(s’, a’) – Q_\pi^{(N-1)}(s, a) \right]$$

Essentially, we have replaced the old “target”:

$$G(s,a)$$

with a new “target”:

$$r + \gamma Q(s’, a’)$$

This should remind you of the right-hand side of the Bellman Equation for Q, with all the expected values removed.

What is the significance of this?

With Monte Carlo, we would have to wait until the entire episode is complete in order to compute the return (since the return is the sum of all future rewards).

For very long episodes, this means that it will take a long time before any updates are made to your agent.

For infinitely long episodes, it’s not possible to compute the return at all! (You’d have to wait an infinite amount of time to get an answer.)

TD learning uses the immediate reward $$r$$ and “estimates the rest” using the existing value of $$Q$$.

Therefore, your agent learns on each step, rather than waiting until the end of an episode.

The Q-Learning “target” is:

$$r + \gamma \max_a’ Q(s’, a’)$$

The main difference between SARSA and Q-Learning is that SARSA is on-policy while Q-Learning is off-policy.

This is because Q-Learning always uses the max, even when that action wasn’t taken. SARSA uses action $$a’$$ – whichever action was actually chosen for the next step. Q-Learning learns the value of the policy in which you take the max (even though your behavior policy may be different – e.g. epsilon-greedy). SARSA learns the value of the behavior policy.

## Expected SARSA

For Expected SARSA, the target is:

$$r + \gamma \sum_{a’} \pi( a’ | s’) Q(s’, a’)$$

This can also be written as:

$$r + \gamma E_\pi [Q(s’, a’) | s’]$$

Thus, Expected SARSA uses the expected action-value for the next state $$s’$$, over all actions $$a’$$, weighted according to the policy distribution.

Like regular SARSA, this should remind you of the Bellman Equation for Q (even more so than regular SARSA since it now properly sums over the policy distribution).

You can think of regular SARSA as merely drawing samples from the Expected SARSA target distribution. Put a different way, SARSA does what Expected SARSA does, in expectation.

So how does Expected SARSA unify SARSA and Q-Learning?

In fact, Q-Learning is merely a special case of Expected SARSA.

First, recognize that Expected SARSA can be on-policy or off-policy. In the off-policy case, one could act according to some behavior policy $$b(a | s)$$, while the target is computed according to the target policy $$\pi( a | s)$$.

If $$b(a | s)$$ is epsilon-greedy while $$\pi(a | s)$$ is greedy, then Expected SARSA is equivalent to Q-Learning.

On the other hand, Expected SARSA generalizes SARSA as well, as discussed above.

## Practical Considerations

1) Aside from all the theory above, the update itself is actually quite simple to implement. In some cases, you may find that it works better than SARSA or Q-Learning. Thus, it is a worthy addition to your RL toolbox and could form a part of your strategy for building agents to solve RL problems.

2) It’s slower than SARSA and Q-Learning since it requires summing over the action space on each update. So, you may have to decide whether the additional computation requirements are worth whatever increase in performance you observe.

# Monte Carlo with Importance Sampling for Reinforcement Learning

March 7, 2021

In this post, we’ll extend our toolset for Reinforcement Learning by considering the Monte Carlo method with importance sampling.

In my course, “Artificial Intelligence: Reinforcement Learning in Python“, you learn about the Monte Carlo method. But that’s just the beginning. There is still more that can be done to improve the agent’s learning capabilities.

## Review of Monte Carlo for Reinforcement Learning

Let’s begin by reviewing the regular Monte Carlo method covered in my Reinforcement Learning course.

Your job in a reinforcement learning task is to program an agent (characterized by a policy) that interacts with an environment (characterized by state transition dynamics). A picture of this process (more precisely, this article discusses a Markov Decision Process) is shown below:

The agent reads in a state $$S_t$$ and decides what action $$A_t$$ to perform based on the state. This is called the policy and can be characterized by a probability distribution, $$\pi( A_t | S_t)$$.

As the agent does this action, it changes the environment which results in the next state $$S_{t+1}$$. A reward signal $$R_{t+1}$$ is also given to the agent.

The goal of an agent is to maximize its sum of future rewards, called the return, $$G_t$$. The discounted return is defined as:

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + … + \gamma^{T – t – 1} R_T$$

Since both the policy and environment transitions can be random, the return can also be random. Because of this, we can’t maximize “the” return (since there are many possible values the return can ultimately be), but only the expected return.

The expected return given that the agent is in state $$S_t$$ and performs action $$A_t$$ at time $$t$$ is given by the Q-table. Specifically:

$$Q_\pi(s, a) = E_\pi[ G_t | S_t = s, A_t = a]$$

The Q-table can be used to determine what the best action will be since we can just choose whichever action $$a$$ maximizes $$Q(s,a)$$.

The problem is, we do not know $$Q(s,a)$$! Furthermore, we cannot calculate it directly since the expected value requires summing over the transition distribution $$p(s’, r | s, a)$$.

Generally speaking, this is unknown. e.g. Imagine building a self-driving car.

The Monte Carlo approach is to estimate the action-value using the sample average. i.e.

$$Q_\pi(s, a) \approx \frac{1}{N}\sum_{i=1}^{N} G^{(i)}(s,a)$$

Where $$G^{(i)}(s,a)$$ was the sample return when the agent was in state $$s$$ and performed action $$a$$ during the $$i$$’th episode.

Put simply: play a bunch of episodes, collect all the state-action-reward sequences, calculate the returns (by summing up the rewards), and then compute the average return for each state-action pair.

Easy!

How can we ensure that we visit every state-action pair so that the whole Q-table is filled up with a sufficient number of samples?

Practically, we usually employ some kind of exploration strategy, such as epsilon-greedy. With epsilon-greedy, we perform the optimal action $$1-\varepsilon$$ of the time, and we pick a random action $$\varepsilon$$ of the time. So $$\varepsilon$$ is the probability of exploration.

The problem with this approach is that it leads to a suboptimal policy. Why? It means that approximately $$\varepsilon$$ of the time, you are going to do something suboptimal!

(Note: it’s not exactly $$\varepsilon$$ since choosing an action randomly can still lead to choosing the optimal action by chance.)

This is where we transition to the main topic of this article, which is how Importance Sampling can help us overcome this problem.

Monte Carlo with epsilon-greedy exploration is called an on-policy control method, because the action-value (Q-table) being estimated corresponds to the policy that the agent is following.

On the other hand, off-policy methods allow the agent to act according to one policy (called the behavior policy), while the action-value is computed for a different, eventually optimal policy (called the target policy).

Henceforth we will denote the target policy as $$\pi( a | s)$$ and the behavior policy as $$b(a | s)$$.

## Importance Sampling

Suppose that we would like to estimate the expected value of some function $$s$$ of a random variable $$X$$ under some distribution $$\pi$$. We write this as:

$$E_\pi[ s(X) ]$$

If we know $$\pi$$, then (assuming $$X$$ is a discrete random variable) this can be computed as:

$$E_\pi[ s(X) ] = \sum_x \pi(x)s(x)$$

If we don’t know $$\pi$$, but $$X$$ is sampled according to $$\pi$$, then this expectation can be estimated by using the sample mean:

$$E_\pi[ s(X) ] \approx \frac{1}{N}\sum_{i=1}^N s(X_i)$$

Now suppose that something prevents us from gathering samples according to the distribution $$\pi$$, but it’s still possible to gather samples from a different distribution $$b$$. i.e. We want $$X \sim \pi$$ but we have $$X \sim b$$ instead.

In this case, the above sample average does not give us the desired expectation. We would be estimating $$E_b[ s(X)]$$ instead of $$E_\pi[ s(X) ]$$.

The importance sampling solution is found by recognizing the following equalities:

$$\begin{eqnarray} && E_\pi \left[ s(X) \right] \\ &=& \sum_x \pi(x)s(x) \\ &=& \sum_x \pi(x)s(x)\frac{b(x)}{b(x)} \\ &=& \sum_x b(x)s(x)\frac{\pi(x)}{b(x)} \\ &=& E_b \left[ s(X)\frac{\pi(X)}{b(X)} \right] \end{eqnarray}$$

This tells us that it’s possible to estimate the expectation under $$\pi$$ even when the samples are drawn according to $$b$$. All we have to do is multiply by the importance sampling ratio, $$\frac{\pi(X)}{b(X)}$$.

The only requirement is that $$b(x)$$ is not 0 when $$\pi(x)$$ is not 0. This is called “coverage”.

## Applying Importance Sampling to Reinforcement Learning

In reinforcement learning, the return $$G_t$$ is generated by acting according to the behavior policy $$b(a | s)$$ with transition dynamics $$p(s’, r | s, a)$$. But we would like to know the expectation of $$G_t$$ under the target policy $$\pi(a | s)$$ with the same transition dynamics.

In this case, the importance sampling ratio is a bit more complicated but can still be derived. $$G_t$$ is a sample from the distribution of $$p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b \mspace{5mu} \forall \tau)$$.

Basically this says: “the distribution of all the actions and states that happened after arriving in state $$S_t$$, following the policy $$b$$”.

The distribution we want the expectation with respect to is the same thing, but with actions drawn according to $$\pi$$. i.e. $$p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi \mspace{5mu} \forall \tau)$$.

Thanks to the Markov property these distributions are easy to expand.

$$p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b) = \prod_{\tau=t}^{T-1} b(A_\tau | S_\tau)p(S_{\tau+1} | S_\tau, A_\tau)$$

$$p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi) = \prod_{\tau=t}^{T-1} \pi(A_\tau | S_\tau)p(S_{\tau+1} | S_\tau, A_\tau)$$

The importance sampling ratio is then just:

$$\frac{p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi)}{p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b)} = \prod_{\tau=t}^{T-1} \frac{\pi(A_\tau | S_\tau)}{b(A_\tau | S_\tau)}$$

The transition dynamics cancel out because they are the same on both top and bottom.

This is convenient, because we know $$\pi$$ and we know $$b$$, but we do not know $$p$$ (which is why we have to use Monte Carlo in the first place).

Let’s define this importance sampling ratio using the symbol $$\rho$$.

$$\rho_{t:T-1} \dot{=} \prod_{\tau=t}^{T-1} \frac{\pi(A_\tau | S_\tau)}{b(A_\tau | S_\tau)}$$

Using this importance sampling ratio, we can estimate $$Q_\pi(s,a)$$ even though we are acting according to a different policy $$b$$ and using the returns generated from this other policy.

$$Q_\pi(s,a) \approx \frac{1}{N}\sum_{i=1}^N \rho^{(i)}(s,a) G^{(i)}(s,a)$$

Where $$G^{(i)}(s,a)$$ was the sample return when the agent was in state $$s$$ and performed action $$a$$ during the $$i$$’th episode, and $$\rho^{(i)}(s,a)$$ was the corresponding importance sampling ratio.

## Importance Sampling for Monte Carlo Implementation

At this point, you know all the theory. All that you have to do now is plug in the above importance sampling ratio in the appropriate places in your existing Monte Carlo code, and you’ll be doing Monte Carlo with importance sampling.

Here are some important considerations.

Like the return $$G_t$$, the importance sampling ratio is defined in terms of future values. i.e. the importance sampling ratio at time $$t$$ depends on the probabilities of the behavior and target policies at time $$t+1, t+2, …$$.

Therefore, it would make sense to loop backwards in time to compute this ratio, just like we loop backwards in time to compute the return.

Just like the return, the importance sampling ratio can be computed recursively.

Finally, you’ll recall that for the regular unweighted sample mean, it’s possible to perform constant-time updates every time we collect a new sample, instead of naively summing up all the samples and dividing by N.

$$Q^{(i)}(s,a) \leftarrow Q^{(i-1)}(s,a) – \frac{1}{i}(G^{(i)}(s,a) – Q^{(i-1)}(s,a))$$

Similarly, it’s possible to express the weighted sample mean update using a similar constant-time operation. i.e.

$$Q^{(i)}(s,a) \leftarrow Q^{(i-1)}(s,a) – \alpha^{(i)}(G^{(i)}(s,a) – Q^{(i-1)}(s,a))$$

As an exercise, try to derive what $$\alpha^{(i)}$$ should be.

Last point: I haven’t discussed weighted importance sampling, which can be used to reduce the variance of the estimate. The weighted importance sampling estimate looks like this:

$$Q_\pi(s,a) \approx \frac{\sum_{i=1}^N \rho^{(i)}(s,a) G^{(i)}(s,a)}{ \sum_{i=1}^N \rho^{(i)}(s,a) }$$

## Recap

Let’s review why this is different from regular Monte Carlo.

Regular Monte Carlo (what I covered in my Reinforcement Learning course) is an on-policy control method.

We use epsilon-greedy because exploration is required to collect enough samples for all state-action pairs. Epsilon-greedy is suboptimal by definition. Our Q-table and final policy will thus be suboptimal.

What might be better is an off-policy control method, where we act according to a behavior policy which allows for exploration, but compute the Q-table according to the target greedy policy (the optimal policy).

#reinforcement learning

# LeetCode in Machine Learning and Data Science (Episode 17)

January 15, 2021

Just the other day, I came across a forum post discussing why job interviews for ML engineers and data scientists often involve algorithms and coding. “I don’t want to be a software engineer!” you say.

Except, this is reality. And if you want a job, you have to face this reality.

In this free video, I will show you many of the insightful comments posted in the discussion, and we are going to complete an example LeetCode interview question from beginning to end. I’m also going to show you a typical answer given by someone who doesn’t know algorithms and why that answer is wrong.

October 26, 2020

Hello all!

Today, I’ve got 2 exciting things for you.

First, now is your chance to VOTE to tell me what you want to see in my next course. Transformers? Time Series Analysis? More advanced GANs? More advanced Reinforcement Learning? Let me know in this survey (it’s anonymous):

https://forms.gle/CNjjdaffSpi67qvq9

Second, check out the latest episode of The Lazy Programmer Show, where I discuss why bad programmers are always trying to get the latest version of some language or library, and why they tend to freak out when things are not in whatever version they happen to be using.

I look at this topic from 3 different perspectives, including:

1) What is it like in the “real world” working at a “real job”?

2) What kind of skills should a typical (competent) programmer have?

3) What are students learning and how do they approach machine learning and coding in colleges? Remember those students become “new grads” and those “new grads” become “junior engineers”.

What would your boss say if a junior engineer could run circles around you, a so-called “professional”?