# Predicting Stock Prices with Facebook Prophet

August 3, 2021

Prophet is Facebook’s library for time series forecasting. It is mainly geared towards business datasets (e.g. predicting adspend or CPU usage), but a natural question that comes up with my students whenever we talk about time series is: “can it predict stock prices?”

In this article, I will discuss how to use FB Prophet to predict stock prices, and I’ll also show you what not to do (things I’ve seen in other popular blogs). Furthermore, we will benchmark the Prophet model with the naive forecast, to check whether or not one would really want to use this.

Note: This is an excerpt from my full VIP course, “Time Series Analysis, Forecasting, and Machine Learning“. If you want the code for this example, along with many, many other code examples on stock prices, sales data, and smartphone data, get the course!

The Prophet section will be part of the VIP version only, so get it now while the VIP coupon is still active!

## How does Prophet work?

The Prophet model is a 3 component, non-autoregressive time series model. Specifically:

$$y(t) = g(t) + s(t) + h(t) + \varepsilon(t)$$

The Prophet model is not autoregressive, like ARIMA, exponential smoothing, and the other methods we study in a typical time series course (including my own).

The 3 components are:

1) The trend $$g(t)$$ which can be either linear or logistic.

2) The seasonality $$s(t)$$, modeled using a Fourier series.

3) The holiday component $$h(t)$$, which is essentially a one-hot vector “dotted” with a vector of weights, each representing the contribution from their respective holiday.

## How to use Prophet for predicting stock prices

In my course, we do 3 experiments. Our data is Google’s stock price from approximately 2013-2018, but we only use the latter 2 years as training data.

The first experiment is “plug-and-play” into Prophet with the default settings.

Here are the results:

Unfortunately, Prophet mistakenly believes there is a weekly seasonal component, which is the reason for the little “hairs” in the forecast.

When we plot the components of the model, we see that Prophet has somehow managed to find some weekly seasonality.

Of course, this is completely wrong! The model believes that the stock price increases on the weekends, which is highly unlikely because we don’t have any data for the weekend.

The second experiment is an example of what not to do. I saw this in every other popular blog, which is yet another “data point” that should convince you not to trust these popular data science blogs you find online (except for mine, obviously).

In this experiment, we set daily_seasonality to True in the model constructor.

Here are the results.

It seems like those weird little “hairs” coming from the weekly seasonal component have disappeared.

“The Lazy Programmer is wrong!” you may proclaim.

However, this is because you may not understand what daily seasonality really means.

Let’s see what happens when we plot the components.

This plot should make you very suspicious. Pay attention to the final chart.

“Daily seasonality” pertains to a pattern that repeats everyday with sub-daily changes.

This cannot be the case, because our data only has daily granularity!

Lesson: don’t listen to those “popular” blogs.

For experiment 3, we set weekly seasonality to False. Alternatively, you could try playing around with the priors.

Here are the results.

Notice that the “little hairs” are again not present.

## Is this model actually good?

Just because you can make a nice chart, does not mean you have done anything useful.

In fact, you see the exact same mistakes in those blog articles and terrible Udemy courses promising to “predict stock prices with LSTMs” (which I will call out every chance I get).

One of the major mistakes I see in nearly every blog post about predicting stock prices is that they don’t bother to compare it to a benchmark. And as you’ll see, the benchmark for stock prices is quite a low bar – there is no reason not to compare.

Your model is only useful if it can beat the benchmark.

For stock price predictions, the benchmark is typically the naive forecast, which is the optimal forecast for a random walk.

Random walks are often used as a model for stock prices since they share some common attributes.

For those unfamiliar, the naive forecast is simply where you predict the last-known value.

Example: If today’s price on July 5 is $200 and I want to make a forecast with a 5-day horizon, then I will predict$200 for July 6, $200 for July 7, …, and$200 for July 10.

I won’t bore you with the code (although it’s included in the course if you’re interested), but the answer is: Prophet does not beat the naive forecast.

In fact, it does not beat the naive forecast on any horizon I tried (5 days, 30 days, 60 days).

Sidenote: it’d be a good exercise to try 1 day as well.

Are stock prices really random walks? Although this particular example provides evidence supporting the random walk hypothesis, in my course, the GARCH section will provide strong evidence against it! Again, it’s all explained in my latest course, “Time Series Analysis, Forecasting, and Machine Learning“. Only the VIP version will contain the sections on Prophet, GARCH, and other important tools.

The VIP version is intended to be limited-time only, and the current coupon expires in less than one month!

Get your copy today while you still can.

# Why do you need math for machine learning and deep learning?

July 9, 2021

In this article, I will demonstrate why math is necessary for machine learning, data science, deep learning, and AI.

Most of my students have already heard this from me countless times. College-level math is a prerequisite for nearly all of my courses already.

Perhaps you may believe I am biased, because I’m the one teaching these courses which require all this math.

It would seem that I am just some crazy guy, making things extra hard for you because I like making things difficult.

WRONG.

You’ve heard it from me many times. Now you’ll hear it from others.

## Example #1

Let’s begin with one of the most famous professors in ML, Daphne Koller, who co-founded Coursera.

In this clip, Lex Fridman asks what advice she would have for those interested in beginning a journey into AI and machine learning.

One important thing she mentions, which I have seen time and time again in my own experience, is that those without typical prerequisite math backgrounds often make mistakes and do things that don’t make sense.

She’s being nice here, but I’ve met many of these folks who not only have no idea that what they are doing does not make sense, they also tend to be overly confident about it!

Then it becomes a burden for me, because I have to put in more effort explaining the basics to you just to convince you that you are wrong.

For that reason, I generally advise against hiring people for ML roles if they do not know basic math.

## Example #2

I enjoyed this strongly worded Reddit comment.

Original post:

Top comment:

## Example #3

Not exactly machine learning, but very related field: quant finance.

In fact, many students taking my courses dream about applying ML to finance.

Well, it’s going to be pretty hard if you can’t pass these interview questions.

http://www.math.kent.edu/~oana/math60070/InterviewProblems.pdf

Think about this logically: All quants who have a job can pass these kinds of interview questions. But you cannot. How well do you think you will do compared to them?

## Example #4

Entrepreneur and angel investor Naval Ravikant explains why deriving (what we do in all of my in-depth machine learning courses) is much more important than memorizing on the Joe Rogan Experience.

Most beginner-level Udemy courses don’t derive anything – they just tell you random facts about ML algorithms and then jump straight to the usual 3 lines of scikit-learn code. Useless!

# Time Series: How to convert AR(p) to VAR(1) and VAR(p) to VAR(1)

July 1, 2021

This is a very condensed post, mainly just so I could write down the equations I need for my Time Series Analysis course. 😉

However, it you find it useful – I am happy to hear that!

[Get 75% off the VIP version here]

$$y_t = b + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \varepsilon_t$$

Suppose we create a vector containing both $$y_t$$ and $$y_{t -1}$$:

$$\begin{bmatrix} y_t \\ y_{t-1} \end{bmatrix}$$

We can write our AR(2) as follows:

$$\begin{bmatrix} y_t \\ y_{t-1} \end{bmatrix} = \begin{bmatrix} b \\ 0 \end{bmatrix} + \begin{bmatrix} \phi_1 & \phi_2 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} y_{t-1} \\ y_{t-2} \end{bmatrix} + \begin{bmatrix} \varepsilon_t \\ 0 \end{bmatrix}$$

Exercise: expand the above to see that you get back the original AR(2). Note that the 2nd line just ends up giving you $$y_{t-1} = y_{t-1}$$.

The above is just a VAR(1)!

You can see this by letting:

$$\textbf{z}_t = \begin{bmatrix} y_t \\ y_{t-1} \end{bmatrix}$$

$$\textbf{b}’ = \begin{bmatrix} b \\ 0 \end{bmatrix}$$

$$\boldsymbol{\Phi}’_1 = \begin{bmatrix} \phi_1 & \phi_2 \\ 1 & 0 \end{bmatrix}$$

$$\boldsymbol{\eta}_t = \begin{bmatrix} \varepsilon_t \\ 0 \end{bmatrix}$$.

Then we get:

$$\textbf{z}_t = \textbf{b}’ + \boldsymbol{\Phi}’_1\textbf{z}_{t-1} + \boldsymbol{\eta}_t$$

Which is a VAR(1).

Now let us try to do the same thing with an AR(3).

$$y_t = b + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \phi_3 y_{t-3} + \varepsilon_t$$

We can write our AR(3) as follows:

$$\begin{bmatrix} y_t \\ y_{t-1} \\ y_{t-2} \end{bmatrix} = \begin{bmatrix} b \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} \phi_1 & \phi_2 & \phi_3 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} y_{t-1} \\ y_{t-2} \\ y_{t-3} \end{bmatrix} + \begin{bmatrix} \varepsilon_t \\ 0 \\ 0 \end{bmatrix}$$

Note that this is also a VAR(1).

Of course, we can just repeat the same pattern for AR(p).

The cool thing is, we can extend this to VAR(p) as well, to show that any VAR(p) can be expressed as a VAR(1).

Suppose we have a VAR(3).

$$\textbf{y}_t = \textbf{b} + \boldsymbol{\Phi}_1 \textbf{y}_{t-1} + \boldsymbol{\Phi}_2 \textbf{y}_{t-2} + \boldsymbol{\Phi}_3 \textbf{y}_{t-3} + \boldsymbol{ \varepsilon }_t$$

Now suppose that we create a new vector by concatenating $$\textbf{y}_t$$, $$\textbf{y}_{t-1}$$, and $$\textbf{y}_{t-2}$$. We get:

$$\begin{bmatrix} \textbf{y}_t \\ \textbf{y}_{t-1} \\ \textbf{y}_{t-2} \end{bmatrix} = \begin{bmatrix} \textbf{b} \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} \boldsymbol{\Phi}_1 & \boldsymbol{\Phi}_2 & \boldsymbol{\Phi}_3 \\ I & 0 & 0 \\ 0 & I & 0 \end{bmatrix} \begin{bmatrix} \textbf{y}_{t-1} \\ \textbf{y}_{t-2} \\ \textbf{y}_{t-3} \end{bmatrix} + \begin{bmatrix} \boldsymbol{\varepsilon_t} \\ 0 \\ 0 \end{bmatrix}$$

This is a VAR(1)!

# Coding Interview Questions – Bioinformatics Rosalind.info – Finding a motif in DNA (Episode 18)

May 4, 2021

Hello all!

In this video, we’re going to do another coding interview question where I walk you through the process of:

– explaining the problem

– solving the problem

– analyzing the time complexity of the solution

These videos are designed to help you practice for future job interviews where you are required to answer technical whiteboard problems.

The problem we’ll solve can be viewed as a bioinformatics problem, but you don’t have to know anything about biology! Even those who only know basic coding should be able to solve this problem.

I hope this motivates you to realize that “bioinformatics” is not all that different from what you already know how to do – and therefore, contributing to this important and exciting field is totally within your reach.

# Reinforcement Learning Algorithms: Expected SARSA

April 6, 2021

In this post, we’ll extend our toolset for Reinforcement Learning by considering a new temporal difference (TD) method called Expected SARSA.

In my course, “Artificial Intelligence: Reinforcement Learning in Python“, you learn about SARSA and Q-Learning, two popular TD methods. We’ll see how Expected SARSA unifies the two.

## Review of SARSA and Q-Learning

Let’s begin by reviewing the regular TD methods covered in my Reinforcement Learning course.

Your job in a reinforcement learning task is to program an agent (characterized by a policy) that interacts with an environment (characterized by state transition dynamics). A picture of this process (more precisely, this article discusses a Markov Decision Process) is shown below:

The agent reads in a state $$S_t$$ and decides what action $$A_t$$ to perform based on the state. This is called the policy and can be characterized by a probability distribution, $$\pi( A_t | S_t)$$.

As the agent does this action, it changes the environment which results in the next state $$S_{t+1}$$. A reward signal $$R_{t+1}$$ is also given to the agent.

The goal of an agent is to maximize its sum of future rewards, called the return, $$G_t$$. The discounted return is defined as:

$$G_t \dot{=} R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + … + \gamma^{T – t – 1} R_T$$

One important relationship we can infer from the definition of the return is:

$$G_t = R_{t + 1} + \gamma G_{t+1}$$

Since both the policy and environment transitions can be random, the return can also be random. Because of this, we can’t maximize “the” return (since there are many possible values the return can ultimately be), but only the expected return.

Taking the expected value of both sides and conditioning on the state $$S_t$$, we arrive at the Bellman Equation:

$$V_\pi(s) = E_\pi[ R_{t + 1} + \gamma V_\pi(S_{t+1}) | S_t = s]$$

If we condition on both the state $$S_t$$ and the action $$A_t$$, we get the Bellman Equation for the action-value:

$$Q_\pi(s, a) = E_\pi \left[ R_{t + 1} + \gamma \sum_{a’} \pi(a’ |S_{t+1}) Q_\pi(S_{t+1}, a’) | S_t = s, A_t = a \right]$$

This will be the important relationship to consider when we learn about Expected SARSA.

Let’s go back a few steps.

The expected return given that the agent is in state $$S_t$$ and performs action $$A_t$$ at time $$t$$ is given by the Q-table. Specifically:

$$Q_\pi(s, a) = E_\pi[ G_t | S_t = s, A_t = a]$$

The Q-table can be used to determine what the best action will be since we can just choose whichever action $$a$$ maximizes $$Q(s,a)$$.

The problem is, we do not know $$Q(s,a)$$! Furthermore, we cannot calculate it directly since the expected value requires summing over the transition distribution $$p(s’, r | s, a)$$.

Generally speaking, this is unknown. e.g. Imagine building a self-driving car.

The full Monte Carlo approach is to estimate the action-value using the sample average. i.e.

$$Q_\pi(s, a) \approx \frac{1}{N}\sum_{i=1}^{N} G^{(i)}(s,a)$$

As you recall, it’s possible to convert the formula for the sample mean into a recursive update – something that looks a bit like gradient descent.

This is convenient so that we can update $$Q$$ each time we receive a new sample without having to sum up all $$N$$ values each time. If we did that, our computations would get longer and longer as $$N$$ grows!

The recursive update looks like this:

$$Q_\pi^{(N)}(s, a) \leftarrow Q_\pi^{(N-1)}(s, a) + \alpha \left[ G^{(N)}(s,a) – Q_\pi^{(N-1)}(s, a) \right]$$

This shows us how to get the $$N$$th estimate from the $$N-1$$th estimate. By setting $$\alpha = \frac{1}{N}$$, we get exactly the sample mean (this was derived in the course).

The Temporal Difference approach uses “bootstrapping”, where we use existing values in $$Q_\pi(s, a)$$ to estimate the expected return.

The SARSA update looks as follows:

$$Q_\pi^{(N)}(s, a) \leftarrow Q_\pi^{(N-1)}(s, a) + \alpha \left[ r + \gamma Q_\pi^{(N-1)}(s’, a’) – Q_\pi^{(N-1)}(s, a) \right]$$

Essentially, we have replaced the old “target”:

$$G(s,a)$$

with a new “target”:

$$r + \gamma Q(s’, a’)$$

This should remind you of the right-hand side of the Bellman Equation for Q, with all the expected values removed.

What is the significance of this?

With Monte Carlo, we would have to wait until the entire episode is complete in order to compute the return (since the return is the sum of all future rewards).

For very long episodes, this means that it will take a long time before any updates are made to your agent.

For infinitely long episodes, it’s not possible to compute the return at all! (You’d have to wait an infinite amount of time to get an answer.)

TD learning uses the immediate reward $$r$$ and “estimates the rest” using the existing value of $$Q$$.

Therefore, your agent learns on each step, rather than waiting until the end of an episode.

The Q-Learning “target” is:

$$r + \gamma \max_a’ Q(s’, a’)$$

The main difference between SARSA and Q-Learning is that SARSA is on-policy while Q-Learning is off-policy.

This is because Q-Learning always uses the max, even when that action wasn’t taken. SARSA uses action $$a’$$ – whichever action was actually chosen for the next step. Q-Learning learns the value of the policy in which you take the max (even though your behavior policy may be different – e.g. epsilon-greedy). SARSA learns the value of the behavior policy.

## Expected SARSA

For Expected SARSA, the target is:

$$r + \gamma \sum_{a’} \pi( a’ | s’) Q(s’, a’)$$

This can also be written as:

$$r + \gamma E_\pi [Q(s’, a’) | s’]$$

Thus, Expected SARSA uses the expected action-value for the next state $$s’$$, over all actions $$a’$$, weighted according to the policy distribution.

Like regular SARSA, this should remind you of the Bellman Equation for Q (even more so than regular SARSA since it now properly sums over the policy distribution).

You can think of regular SARSA as merely drawing samples from the Expected SARSA target distribution. Put a different way, SARSA does what Expected SARSA does, in expectation.

So how does Expected SARSA unify SARSA and Q-Learning?

In fact, Q-Learning is merely a special case of Expected SARSA.

First, recognize that Expected SARSA can be on-policy or off-policy. In the off-policy case, one could act according to some behavior policy $$b(a | s)$$, while the target is computed according to the target policy $$\pi( a | s)$$.

If $$b(a | s)$$ is epsilon-greedy while $$\pi(a | s)$$ is greedy, then Expected SARSA is equivalent to Q-Learning.

On the other hand, Expected SARSA generalizes SARSA as well, as discussed above.

## Practical Considerations

1) Aside from all the theory above, the update itself is actually quite simple to implement. In some cases, you may find that it works better than SARSA or Q-Learning. Thus, it is a worthy addition to your RL toolbox and could form a part of your strategy for building agents to solve RL problems.

2) It’s slower than SARSA and Q-Learning since it requires summing over the action space on each update. So, you may have to decide whether the additional computation requirements are worth whatever increase in performance you observe.

# Monte Carlo with Importance Sampling for Reinforcement Learning

March 7, 2021

In this post, we’ll extend our toolset for Reinforcement Learning by considering the Monte Carlo method with importance sampling.

In my course, “Artificial Intelligence: Reinforcement Learning in Python“, you learn about the Monte Carlo method. But that’s just the beginning. There is still more that can be done to improve the agent’s learning capabilities.

## Review of Monte Carlo for Reinforcement Learning

Let’s begin by reviewing the regular Monte Carlo method covered in my Reinforcement Learning course.

Your job in a reinforcement learning task is to program an agent (characterized by a policy) that interacts with an environment (characterized by state transition dynamics). A picture of this process (more precisely, this article discusses a Markov Decision Process) is shown below:

The agent reads in a state $$S_t$$ and decides what action $$A_t$$ to perform based on the state. This is called the policy and can be characterized by a probability distribution, $$\pi( A_t | S_t)$$.

As the agent does this action, it changes the environment which results in the next state $$S_{t+1}$$. A reward signal $$R_{t+1}$$ is also given to the agent.

The goal of an agent is to maximize its sum of future rewards, called the return, $$G_t$$. The discounted return is defined as:

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + … + \gamma^{T – t – 1} R_T$$

Since both the policy and environment transitions can be random, the return can also be random. Because of this, we can’t maximize “the” return (since there are many possible values the return can ultimately be), but only the expected return.

The expected return given that the agent is in state $$S_t$$ and performs action $$A_t$$ at time $$t$$ is given by the Q-table. Specifically:

$$Q_\pi(s, a) = E_\pi[ G_t | S_t = s, A_t = a]$$

The Q-table can be used to determine what the best action will be since we can just choose whichever action $$a$$ maximizes $$Q(s,a)$$.

The problem is, we do not know $$Q(s,a)$$! Furthermore, we cannot calculate it directly since the expected value requires summing over the transition distribution $$p(s’, r | s, a)$$.

Generally speaking, this is unknown. e.g. Imagine building a self-driving car.

The Monte Carlo approach is to estimate the action-value using the sample average. i.e.

$$Q_\pi(s, a) \approx \frac{1}{N}\sum_{i=1}^{N} G^{(i)}(s,a)$$

Where $$G^{(i)}(s,a)$$ was the sample return when the agent was in state $$s$$ and performed action $$a$$ during the $$i$$’th episode.

Put simply: play a bunch of episodes, collect all the state-action-reward sequences, calculate the returns (by summing up the rewards), and then compute the average return for each state-action pair.

Easy!

How can we ensure that we visit every state-action pair so that the whole Q-table is filled up with a sufficient number of samples?

Practically, we usually employ some kind of exploration strategy, such as epsilon-greedy. With epsilon-greedy, we perform the optimal action $$1-\varepsilon$$ of the time, and we pick a random action $$\varepsilon$$ of the time. So $$\varepsilon$$ is the probability of exploration.

The problem with this approach is that it leads to a suboptimal policy. Why? It means that approximately $$\varepsilon$$ of the time, you are going to do something suboptimal!

(Note: it’s not exactly $$\varepsilon$$ since choosing an action randomly can still lead to choosing the optimal action by chance.)

This is where we transition to the main topic of this article, which is how Importance Sampling can help us overcome this problem.

Monte Carlo with epsilon-greedy exploration is called an on-policy control method, because the action-value (Q-table) being estimated corresponds to the policy that the agent is following.

On the other hand, off-policy methods allow the agent to act according to one policy (called the behavior policy), while the action-value is computed for a different, eventually optimal policy (called the target policy).

Henceforth we will denote the target policy as $$\pi( a | s)$$ and the behavior policy as $$b(a | s)$$.

## Importance Sampling

Suppose that we would like to estimate the expected value of some function $$s$$ of a random variable $$X$$ under some distribution $$\pi$$. We write this as:

$$E_\pi[ s(X) ]$$

If we know $$\pi$$, then (assuming $$X$$ is a discrete random variable) this can be computed as:

$$E_\pi[ s(X) ] = \sum_x \pi(x)s(x)$$

If we don’t know $$\pi$$, but $$X$$ is sampled according to $$\pi$$, then this expectation can be estimated by using the sample mean:

$$E_\pi[ s(X) ] \approx \frac{1}{N}\sum_{i=1}^N s(X_i)$$

Now suppose that something prevents us from gathering samples according to the distribution $$\pi$$, but it’s still possible to gather samples from a different distribution $$b$$. i.e. We want $$X \sim \pi$$ but we have $$X \sim b$$ instead.

In this case, the above sample average does not give us the desired expectation. We would be estimating $$E_b[ s(X)]$$ instead of $$E_\pi[ s(X) ]$$.

The importance sampling solution is found by recognizing the following equalities:

$$\begin{eqnarray} && E_\pi \left[ s(X) \right] \\ &=& \sum_x \pi(x)s(x) \\ &=& \sum_x \pi(x)s(x)\frac{b(x)}{b(x)} \\ &=& \sum_x b(x)s(x)\frac{\pi(x)}{b(x)} \\ &=& E_b \left[ s(X)\frac{\pi(X)}{b(X)} \right] \end{eqnarray}$$

This tells us that it’s possible to estimate the expectation under $$\pi$$ even when the samples are drawn according to $$b$$. All we have to do is multiply by the importance sampling ratio, $$\frac{\pi(X)}{b(X)}$$.

The only requirement is that $$b(x)$$ is not 0 when $$\pi(x)$$ is not 0. This is called “coverage”.

## Applying Importance Sampling to Reinforcement Learning

In reinforcement learning, the return $$G_t$$ is generated by acting according to the behavior policy $$b(a | s)$$ with transition dynamics $$p(s’, r | s, a)$$. But we would like to know the expectation of $$G_t$$ under the target policy $$\pi(a | s)$$ with the same transition dynamics.

In this case, the importance sampling ratio is a bit more complicated but can still be derived. $$G_t$$ is a sample from the distribution of $$p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b \mspace{5mu} \forall \tau)$$.

Basically this says: “the distribution of all the actions and states that happened after arriving in state $$S_t$$, following the policy $$b$$”.

The distribution we want the expectation with respect to is the same thing, but with actions drawn according to $$\pi$$. i.e. $$p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi \mspace{5mu} \forall \tau)$$.

Thanks to the Markov property these distributions are easy to expand.

$$p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b) = \prod_{\tau=t}^{T-1} b(A_\tau | S_\tau)p(S_{\tau+1} | S_\tau, A_\tau)$$

$$p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi) = \prod_{\tau=t}^{T-1} \pi(A_\tau | S_\tau)p(S_{\tau+1} | S_\tau, A_\tau)$$

The importance sampling ratio is then just:

$$\frac{p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi)}{p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b)} = \prod_{\tau=t}^{T-1} \frac{\pi(A_\tau | S_\tau)}{b(A_\tau | S_\tau)}$$

The transition dynamics cancel out because they are the same on both top and bottom.

This is convenient, because we know $$\pi$$ and we know $$b$$, but we do not know $$p$$ (which is why we have to use Monte Carlo in the first place).

Let’s define this importance sampling ratio using the symbol $$\rho$$.

$$\rho_{t:T-1} \dot{=} \prod_{\tau=t}^{T-1} \frac{\pi(A_\tau | S_\tau)}{b(A_\tau | S_\tau)}$$

Using this importance sampling ratio, we can estimate $$Q_\pi(s,a)$$ even though we are acting according to a different policy $$b$$ and using the returns generated from this other policy.

$$Q_\pi(s,a) \approx \frac{1}{N}\sum_{i=1}^N \rho^{(i)}(s,a) G^{(i)}(s,a)$$

Where $$G^{(i)}(s,a)$$ was the sample return when the agent was in state $$s$$ and performed action $$a$$ during the $$i$$’th episode, and $$\rho^{(i)}(s,a)$$ was the corresponding importance sampling ratio.

## Importance Sampling for Monte Carlo Implementation

At this point, you know all the theory. All that you have to do now is plug in the above importance sampling ratio in the appropriate places in your existing Monte Carlo code, and you’ll be doing Monte Carlo with importance sampling.

Here are some important considerations.

Like the return $$G_t$$, the importance sampling ratio is defined in terms of future values. i.e. the importance sampling ratio at time $$t$$ depends on the probabilities of the behavior and target policies at time $$t+1, t+2, …$$.

Therefore, it would make sense to loop backwards in time to compute this ratio, just like we loop backwards in time to compute the return.

Just like the return, the importance sampling ratio can be computed recursively.

Finally, you’ll recall that for the regular unweighted sample mean, it’s possible to perform constant-time updates every time we collect a new sample, instead of naively summing up all the samples and dividing by N.

$$Q^{(i)}(s,a) \leftarrow Q^{(i-1)}(s,a) – \frac{1}{i}(G^{(i)}(s,a) – Q^{(i-1)}(s,a))$$

Similarly, it’s possible to express the weighted sample mean update using a similar constant-time operation. i.e.

$$Q^{(i)}(s,a) \leftarrow Q^{(i-1)}(s,a) – \alpha^{(i)}(G^{(i)}(s,a) – Q^{(i-1)}(s,a))$$

As an exercise, try to derive what $$\alpha^{(i)}$$ should be.

Last point: I haven’t discussed weighted importance sampling, which can be used to reduce the variance of the estimate. The weighted importance sampling estimate looks like this:

$$Q_\pi(s,a) \approx \frac{\sum_{i=1}^N \rho^{(i)}(s,a) G^{(i)}(s,a)}{ \sum_{i=1}^N \rho^{(i)}(s,a) }$$

## Recap

Let’s review why this is different from regular Monte Carlo.

Regular Monte Carlo (what I covered in my Reinforcement Learning course) is an on-policy control method.

We use epsilon-greedy because exploration is required to collect enough samples for all state-action pairs. Epsilon-greedy is suboptimal by definition. Our Q-table and final policy will thus be suboptimal.

What might be better is an off-policy control method, where we act according to a behavior policy which allows for exploration, but compute the Q-table according to the target greedy policy (the optimal policy).

#reinforcement learning

# LeetCode in Machine Learning and Data Science (Episode 17)

January 15, 2021

Just the other day, I came across a forum post discussing why job interviews for ML engineers and data scientists often involve algorithms and coding. “I don’t want to be a software engineer!” you say.

Except, this is reality. And if you want a job, you have to face this reality.

In this free video, I will show you many of the insightful comments posted in the discussion, and we are going to complete an example LeetCode interview question from beginning to end. I’m also going to show you a typical answer given by someone who doesn’t know algorithms and why that answer is wrong.

October 26, 2020

Hello all!

Today, I’ve got 2 exciting things for you.

First, now is your chance to VOTE to tell me what you want to see in my next course. Transformers? Time Series Analysis? More advanced GANs? More advanced Reinforcement Learning? Let me know in this survey (it’s anonymous):

https://forms.gle/CNjjdaffSpi67qvq9

Second, check out the latest episode of The Lazy Programmer Show, where I discuss why bad programmers are always trying to get the latest version of some language or library, and why they tend to freak out when things are not in whatever version they happen to be using.

I look at this topic from 3 different perspectives, including:

1) What is it like in the “real world” working at a “real job”?

2) What kind of skills should a typical (competent) programmer have?

3) What are students learning and how do they approach machine learning and coding in colleges? Remember those students become “new grads” and those “new grads” become “junior engineers”.

What would your boss say if a junior engineer could run circles around you, a so-called “professional”?

# Tensorflow 2 One Year Later: What do I think now? (+PyTorch, JAX, Julia)

September 24, 2020

In the latest episode of the Lazy Programmer Show, I give you my honest opinion of Tensorflow 2, one year later after creating the leading Tensorflow 2 course on Udemy.

Is it still good?

Is it still worth learning?

Why does stuff keep breaking / changing?

How does it compare to other offerings? (PyTorch, JAX, Julia)

Did you know JAX was created by Google? (thus, using Tensorflow doesn’t equate to “doing what Google’s doing”)

Julia is a totally different programming language popular with many data scientists and machine learning engineers. Will it replace Python?

PLUS, a little bonus (but you’ll have to watch the video to see what it is) 😉

Check out the video here: https://youtu.be/A9lvfm3k6m4

Or just watch it here:

# How to Build Your Own Computer Science Degree

August 29, 2020

Note: You can find the video lecture for this article at https://youtu.be/C-RZUWOBDpY

Summary of the video:

The following books can be used to study core computer science topics at the college / university level, to prepare yourself for machine learning, deep learning, artificial intelligence, and data science.

These are the books I recommend for building your own computer science degree. Remember! The goal is to do as many exercises as you can. It’s not to just watch 5 minute YouTube videos and then conclude “I understand everything! There’s no need for exercises!”

The purpose of actually learning this material is so you can “do” useful things.

It’s not about memorizing a set of rules about calculus, a set of rules about probability, etc.

Sidenote: if you prefer video courses over books, then check out the links at the bottom of this page (under “other resources”):

Deep Learning Courses

However, note that these courses do not come with sufficient exercises. For those, you are still strongly recommended to try the ones from these books, which should test you at the appropriate level.

These books cover common core courses that are relevant for many sub-fields of Computer Science and Engineering, including Machine Learning et. al., but also related fields such as operations research, statistics, quantitative finance, software engineering, digital communications, wireless communications, control systems (e.g. autopilot), robotics, and many more.

To recap, these are the courses and why you want to take them:

Calculus

Nearly all machine learning algorithms boil down to optimization problems. What is optimization? Generally speaking, it’s when you have a function and you want to maximize or minimize that function.

If you’ve taken calculus, then you should recall that this is exactly what you learn how to do in calculus.

Therefore, calculus is an essential tool in artificial intelligence, data science, etc.

Linear Algebra

In machine learning and deep learning especially, we work with vectors, matrices, and higher-dimensional objects. This is the realm of linear algebra.

Luckily, you don’t have to go that far in linear algebra to get what you need for deep learning and AI.

For example, the concept of spans, subspaces, rank, etc. rarely show up in machine learning.

On the other hand, the basics, such as matrix and vector multiplication and eigenvalues and eigenvectors, show up often.

Probability

Probability is the language you must speak if you want to do machine learning and AI.

Recall above that machine learning often boils down to an optimization. What are we trying to optimize? Often, it’s an expected value. What is an expected value? Well, you have to learn probability to find that out.

For a time, “probabilistic graphical models” were the state of the art in artificial intelligence. Clearly, probability would be a prerequisite.

Probability shows up nearly everywhere in machine learning, from soft k-means clustering to hidden Markov models to artificial neural networks and beyond.

Side note: if you were thinking earlier, “who needs calculus when Tensorflow can do it for me!?”, think again. Calculus is a prerequisite to probability. So if you want to learn probability, you still need calculus anyway.

Programming

Obviously, at some point, you need to be able to actually write a computer program in order to use machine learning.

Nowadays, things can seem very simple when all you need is 3 lines of boilerplate code to use scikit-learn.

However, that’s not really what one should imagine when they think of “learning machine learning”.

Check any college-level machine learning course (not that weak sauce being sold by marketers online) to confirm what I am saying.

As the great physicist Richard Feynman once said, “What I cannot create, I do not understand”.

In order to best understand a machine learning algorithm, you should implement it.

No, you are not “reinventing the wheel”. This is called “learning”.

I would posit that if you can’t implement basic algorithms like k-means clustering, logistic regression, k-nearest neighbor, and naive Bayes, you do not understand those algorithms.

So why do I suggest Java over something like Python, which has easily become the most popular language for doing data science.

The problem with Python is that it’s too high level. It doesn’t make you think about the program you are writing at the algorithmic level.

You should understand the difference between an efficient and an inefficient algorithm. (No, that doesn’t mean memorizing facts like “Python list comprehensions are better than for loops”).

In fact you should recognize that list comprehensions have the exact same time complexity as your for loop.

Java, being slightly lower level, forces you to think algorithmically.

That brings us to the final topic.

Algorithms and Data Structures

In order to really understand algorithms, you should study… algorithms.

There are many famous algorithms contained in the book I’ve suggested below.

Realistically, you are not going to use these in your day to day work (a very common complaint from software developers seeking employment).

However, that’s not really the point.

The point is exercising your brain and learning how to think in certain ways that help you write better code.

Also, you should understand the pros and cons of basic data structures such as lists, sets, dictionaries, and trees.

If you’re coding up some algorithm, why might a set be better than a list? Algorithms tell you why.

One major data structure you might want to learn about is graphs (along with their associated algorithms). Graph neural networks seem to be picking up steam, and they are being used for all kinds of interesting problems like social network analysis, chemistry, and more.

To summarize: the core courses I would consider essential for building your own Computer Science degree in preparation for machine learning, deep learning, data science, and artificial intelligence are calculus, linear algebra, probability, programming, and algorithms.

Don’t just watch a bunch of videos on YouTube or Khan Academy and then proclaim you understand the subject. The reason I’ve suggested books is because they contain exercises / homework problems. These are what you must be able to do in order to claim that you understand something. It’s not about “absorbing information”, it’s about “producing useful output”.

Common question: What about C++? Yes, C++ is excellent! Ideally, you will learn both C++ and Java, but obviously, these are not hard prerequisites for machine learning or data science.

### Introduction to Algorithms by Cormen, Leiserson, Rivest, and Stein

Disclaimer: this post contains Amazon affiliate links.