# NEW COURSE: Financial Engineering and Artificial Intelligence in Python

September 8, 2020

# VIP Promotion

### The complete Financial Engineering course has arrived

Hello once again friends!

Today, I am announcing the VIP version of my latest course: Financial Engineering and Artificial Intelligence in Python.

https://www.udemy.com/course/ai-finance/?couponCode=FINANCEVIP (expires Oct 9, 2020)

https://www.udemy.com/course/ai-finance/?couponCode=FINANCEVIP9 (expires June 16, 2021)

(as usual, this coupon lasts only 30 days, so don’t wait!)

This is a MASSIVE (20 hours) Financial Engineering course covering the core fundamentals of financial engineering and financial analysis from scratch. We will go in-depth into all the classic topics, such as:

• Exploratory data analysis, significance testing, correlations
• Alpha and beta
• Advanced Pandas Data Frame manipulation for time series and finance
• Time series analysis, simple moving average, exponentially-weighted moving average
• Holt-Winters exponential smoothing model
• ARIMA and SARIMA
• Efficient Market Hypothesis
• Random Walk Hypothesis
• Time series forecasting (“stock price prediction”)
• Modern portfolio theory
• Efficient frontier / Markowitz bullet
• Mean-variance optimization
• Maximizing the Sharpe ratio
• Convex optimization with Linear Programming and Quadratic Programming
• Capital Asset Pricing Model (CAPM)

In addition, we will look at various non-traditional techniques which stem purely from the field of machine learning and artificial intelligence, such as:

• Regression models
• Classification models
• Unsupervised learning
• Reinforcement learning and Q-learning

We will learn about the greatest flub made in the past decade by marketers posing as “machine learning experts” who promise to teach unsuspecting students how to “predict stock prices with LSTMs”. You will learn exactly why their methodology is fundamentally flawed and why their results are complete nonsense. It is a lesson in how not to apply AI in finance.

### List of VIP-only Contents

As with my Tensorflow 2 release, some of the VIP content will be a surprise and will be released in stages. Currently, the entirety of the Algorithmic Trading sections are VIP sections. Newly added VIP sections include Statistical Factor Models and “The Lazy Programmer Bonus Offer”. Here’s a full list:

Classic Algorithmic Trading – Trend Following Strategy

You will learn how moving averages can be applied to do algorithmic trading.

Forecast returns in order to determine when to buy and sell.

I give you a full introduction to Reinforcement Learning from scratch, and then we apply it to build a Q-Learning trader. Note that this is *not* the same as the example I used in my Tensorflow 2, PyTorch, and Reinforcement Learning courses. I think the example included in this course is much more principled and robust.

Statistical Factor Models

The CAPM is one of the most renowned financial models in history, but did you know it’s only the simplest factor model, with just a single factor? To go beyond just this single factor model, we will learn about statistical factor models, where the multiple “factors” are found automatically using only the data.

The Lazy Programmer Bonus Offer

There are marketers out there who want to capitalize on your enthusiastic interest in finance, and unfortunately what they are teaching you is utter and complete garbage.

They will claim that they can “predict stock prices with LSTMs” and show you charts like this with nearly perfect stock price predictions.

Hint: if they can do this, why do they bother putting effort into making courses? Wouldn’t they already be billionaires?

Have you ever wondered if you are taking such a course from a fake data scientist / marketer? If so, just send me a message, and I will tell you whether or not you are taking such a course. (Hint: many of you are) I will give you a list of mistakes they made so you can look out for them yourself, and avoid “learning” things which will ultimately make YOU look very bad in front of potential future employers.

Believe me, if you ever try to get a job in machine learning or data science and you talk about a project where you “predicted stock prices with LSTMs”, all you will be demonstrating is how incompetent you are.

Save yourself from this embarrassing scenario by taking the “Lazy Programmer Offer”!

Please note: The VIP coupon will work only for the next month (starting from the coupon creation time). It’s unknown whether the VIP period will renew after that time.

After that, although the VIP content will be removed from Udemy, all who purchased the VIP course will get permanent free access to these VIP contents on deeplearningcourses.com.

In case it’s not clear, the process is very easy. For those folks who need the “step-by-step” instructions…:

STEP 1) I announce the VIP content will be removed.

STEP 2) You email me with proof that you purchased the course during the VIP period. Do NOT email me earlier as it will just get buried.

STEP 3) I will give you free access to the VIP materials for this course on deeplearningcourses.com.

### Benefits of taking this course

• Learn the knowledge you need to work at top tier investment firms
• Gain practical, real-world quantitative skills that can be applied within and outside of finance
• Make better decisions regarding your own finances

Personally, I think this is the most interesting and action-packed course I have created yet. My last few courses were cool, but they were all about topics which I had already covered in the past! GANs, NLP, Transfer Learning, Recommender Systems, etc etc. all just machine learning topics I have covered several times in different libraries. This course contains new, fresh content and concepts I have never covered in any of my courses, ever.

This is the first course I’ve created that extends into a niche area of AI application. It goes outside of AI and into domain expertise. An in-depth topic such as finance deserves its own course. This is that course. These are topics you will never learn in a generic data science or machine learning course. However, as a student of AI, you will recognize many of our tools and methods being applied, such as statistical inference, supervised and unsupervised learning, convex optimization, and optimal control. This allows us to go deeper than your run of the mill financial engineering course, and it becomes more than just the sum of its parts.

So what are you waiting for?

April 1, 2020

# VIP Promotion

### The complete PyTorch course has arrived

Hello friends!

I hope you are all staying safe. Well, I’m sure you’ve heard enough about that so how about some different news?

Today, I am announcing the VIP version of my latest course: PyTorch: Deep Learning and Artificial Intelligence

https://www.udemy.com/course/pytorch-deep-learning/?couponCode=PYTORCHVIP14 (expires June 16, 2021)

This is a MASSIVE (over 22 hours) Deep Learning course covering EVERYTHING from scratch. That includes:

• Machine learning basics (linear neurons)
• ANNs, CNNs, and RNNs for images and sequence data
• Time series forecasting and stock predictions (+ why all those fake data scientists are doing it wrong)
• NLP (natural language processing)
• Recommender systems
• Transfer learning for computer vision
• Deep reinforcement learning and applying it by building a stock trading bot

IN ADDITION, you will get some unique and never-before-seen VIP projects:

Estimating prediction uncertainty

Drawing the standard deviation of the prediction along with the prediction itself. This is useful for heteroskedastic data (that means the variance changes as a function of the input). The most popular application where heteroskedasticity appears is stock prices and stock returns – which I know a lot of you are interested in.

It allows you to draw your model predictions like this:

Sometimes, the data is simply such that a spot-on prediction can’t be made. But we can do better by letting the model tell us how certain it is in its predictions.

Facial recognition with siamese networks

This one is cool. I mean, I don’t have to tell you how big facial recognition has become, right? It’s the single most controversial technology to come out of deep learning. In the past, we looked at simple ways of doing this with classification, but in this section I will teach you about an architecture built specifically for facial recognition.

You will learn how this can work even on small datasets – so you can build a network that recognizes your friends or can even identify all of your coworkers!

You can really impress your boss with this one. Surprise them one day with an app that calls out your coworkers by name every time they walk by your desk. 😉

Please note: The VIP coupon will work only for the next month (ending May 1, 2020). It’s unknown whether the VIP period will renew after that time.

After that, although the VIP content will be removed from Udemy, all who purchased the VIP course will get permanent free access on deeplearningcourses.com.

## Minimal Prerequisites

This course is designed to be a beginner to advanced course. All that is required is that you take my free Numpy prerequisites to learn some basic scientific programming in Python. And it’s free, so why wouldn’t you!?

You will learn things that took me years to learn on my own. For many people, that is worth tens of thousands of dollars by itself.

There is no heavy math, no backpropagation, etc. Why? Because I already have courses on those things. So there’s no need to repeat them here, and PyTorch doesn’t use them. So you can relax and have fun. =)

## Why PyTorch?

All of my deep learning courses until now have been in Tensorflow (and prior to that Theano).

So why learn PyTorch?

Does this mean my future deep learning courses will use PyTorch?

In fact, if you have traveled in machine learning circles recently, you will have noticed that there has been a strong shift to PyTorch.

Case in point: OpenAI switched to PyTorch earlier this year (2020).

Major AI shops such as Apple, JPMorgan Chase, and Qualcomm have adopted PyTorch.

PyTorch is primarily maintained by Facebook (Facebook AI Research to be specific) – the “other” Internet giant who, alongside Google, have a strong vested interest in developing state-of-the-art AI.

But why PyTorch for you and me? (aside from the fact that you might want to work for one of the above companies)

As you know, Tensorflow has adopted the super simple Keras API. This makes common things easy, but it makes uncommon things hard.

With PyTorch, common things take a tiny bit of extra effort, but the upside is that uncommon things are still very easy.

Creating your own custom models and inventing your own ideas is seamless. We will see many examples of that in this course.

For this reason, it is very possible that future deep learning courses will use PyTorch, especially for those advanced topics that many of you have been asking for.

Because of the ease at which you can do advanced things, PyTorch is the main library used by deep learning researchers around the world. If that’s your goal, then PyTorch is for you.

In terms of growth rate, PyTorch dominates Tensorflow. PyTorch now outnumbers Tensorflow by 2:1 and even 3:1 at major machine learning conferences. Researchers hold that PyTorch is superior to Tensorflow in terms of the simplicity of its API, and even speed / performance!

Do you need more convincing?

May 18, 2021

# VIP Promotion

Hello all!

In this post, I am announcing the VIP coupon to my course titled “Artificial Intelligence: Reinforcement Learning in Python”.

There are 2 places to get the course.

1. Udemy, with this VIP coupon: https://www.udemy.com/course/artificial-intelligence-reinforcement-learning-in-python/?couponCode=REINFORCEMENTVIP (expires June 16, 2021)
2. Deep Learning Courses (coupon automatically applied): https://deeplearningcourses.com/c/artificial-intelligence-reinforcement-learning-in-python

You may recognize this course as one that has already existed in my catalog – however, the course I am announcing today contains ALL-NEW material. The entire course has been gutted and every lecture contained within the course did not exist in the original version.

One of the most common questions I get from students in my PyTorch, Tensorflow 2, and Financial Engineering courses is: “How can I learn reinforcement learning?”

While I do cover RL in those courses, it’s very brief. I’ve essentially summarized 12 hours of material into 2. So by necessity, you will be missing some things.

While that serves as a good way to scratch the surface of RL, it doesn’t give you a true, in-depth understanding that you will get by actually learning each component of RL step-by-step, and most importantly, getting a chance to put everything into code!

This course covers:

• The explore-exploit dilemma and the Bayesian bandit method
• MDPs (Markov Decision Processes)
• Dynamic Programming solution for MDPs
• Monte Carlo Method
• Temporal Difference Method (including Q-Learning)
• Approximation Methods using Radial Basis Functions
• Applying your code to OpenAI Gym with zero effort / code changes
• Building a stock trading bot (different approach in each course!)

When you get the DeepLearningCourses.com version, note that you will get both versions (new and old) of the course – totalling nearly 20 hours of material.

If you want access to the tic-tac-toe project, this is the version you should get.

Otherwise, if you prefer to use Udemy, that’s fine too. If you purchase on Udemy but would like access to DeepLearningCourses.com, I will allow this since they are the same price. Just send me an email and show me your proof of purchase.

Note that I’m not able to offer the reverse (can’t give you access to Udemy if you purchase on DeepLerningCourses.com, due to operational reasons).

So what are you waiting for?

# Coding Interview Questions – Bioinformatics Rosalind.info – Finding a motif in DNA (Episode 18)

May 4, 2021

Hello all!

In this video, we’re going to do another coding interview question where I walk you through the process of:

– explaining the problem

– solving the problem

– analyzing the time complexity of the solution

These videos are designed to help you practice for future job interviews where you are required to answer technical whiteboard problems.

The problem we’ll solve can be viewed as a bioinformatics problem, but you don’t have to know anything about biology! Even those who only know basic coding should be able to solve this problem.

I hope this motivates you to realize that “bioinformatics” is not all that different from what you already know how to do – and therefore, contributing to this important and exciting field is totally within your reach.

# Reinforcement Learning Algorithms: Expected SARSA

April 6, 2021

In this post, we’ll extend our toolset for Reinforcement Learning by considering a new temporal difference (TD) method called Expected SARSA.

In my course, “Artificial Intelligence: Reinforcement Learning in Python“, you learn about SARSA and Q-Learning, two popular TD methods. We’ll see how Expected SARSA unifies the two.

## Review of SARSA and Q-Learning

Let’s begin by reviewing the regular TD methods covered in my Reinforcement Learning course.

Your job in a reinforcement learning task is to program an agent (characterized by a policy) that interacts with an environment (characterized by state transition dynamics). A picture of this process (more precisely, this article discusses a Markov Decision Process) is shown below:

The agent reads in a state $$S_t$$ and decides what action $$A_t$$ to perform based on the state. This is called the policy and can be characterized by a probability distribution, $$\pi( A_t | S_t)$$.

As the agent does this action, it changes the environment which results in the next state $$S_{t+1}$$. A reward signal $$R_{t+1}$$ is also given to the agent.

The goal of an agent is to maximize its sum of future rewards, called the return, $$G_t$$. The discounted return is defined as:

$$G_t \dot{=} R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + … + \gamma^{T – t – 1} R_T$$

One important relationship we can infer from the definition of the return is:

$$G_t = R_{t + 1} + \gamma G_{t+1}$$

Since both the policy and environment transitions can be random, the return can also be random. Because of this, we can’t maximize “the” return (since there are many possible values the return can ultimately be), but only the expected return.

Taking the expected value of both sides and conditioning on the state $$S_t$$, we arrive at the Bellman Equation:

$$V_\pi(s) = E_\pi[ R_{t + 1} + \gamma V_\pi(S_{t+1}) | S_t = s]$$

If we condition on both the state $$S_t$$ and the action $$A_t$$, we get the Bellman Equation for the action-value:

$$Q_\pi(s, a) = E_\pi \left[ R_{t + 1} + \gamma \sum_{a’} \pi(a’ |S_{t+1}) Q_\pi(S_{t+1}, a’) | S_t = s, A_t = a \right]$$

This will be the important relationship to consider when we learn about Expected SARSA.

Let’s go back a few steps.

The expected return given that the agent is in state $$S_t$$ and performs action $$A_t$$ at time $$t$$ is given by the Q-table. Specifically:

$$Q_\pi(s, a) = E_\pi[ G_t | S_t = s, A_t = a]$$

The Q-table can be used to determine what the best action will be since we can just choose whichever action $$a$$ maximizes $$Q(s,a)$$.

The problem is, we do not know $$Q(s,a)$$! Furthermore, we cannot calculate it directly since the expected value requires summing over the transition distribution $$p(s’, r | s, a)$$.

Generally speaking, this is unknown. e.g. Imagine building a self-driving car.

The full Monte Carlo approach is to estimate the action-value using the sample average. i.e.

$$Q_\pi(s, a) \approx \frac{1}{N}\sum_{i=1}^{N} G^{(i)}(s,a)$$

As you recall, it’s possible to convert the formula for the sample mean into a recursive update – something that looks a bit like gradient descent.

This is convenient so that we can update $$Q$$ each time we receive a new sample without having to sum up all $$N$$ values each time. If we did that, our computations would get longer and longer as $$N$$ grows!

The recursive update looks like this:

$$Q_\pi^{(N)}(s, a) \leftarrow Q_\pi^{(N-1)}(s, a) + \alpha \left[ G^{(N)}(s,a) – Q_\pi^{(N-1)}(s, a) \right]$$

This shows us how to get the $$N$$th estimate from the $$N-1$$th estimate. By setting $$\alpha = \frac{1}{N}$$, we get exactly the sample mean (this was derived in the course).

The Temporal Difference approach uses “bootstrapping”, where we use existing values in $$Q_\pi(s, a)$$ to estimate the expected return.

The SARSA update looks as follows:

$$Q_\pi^{(N)}(s, a) \leftarrow Q_\pi^{(N-1)}(s, a) + \alpha \left[ r + \gamma Q_\pi^{(N-1)}(s’, a’) – Q_\pi^{(N-1)}(s, a) \right]$$

Essentially, we have replaced the old “target”:

$$G(s,a)$$

with a new “target”:

$$r + \gamma Q(s’, a’)$$

This should remind you of the right-hand side of the Bellman Equation for Q, with all the expected values removed.

What is the significance of this?

With Monte Carlo, we would have to wait until the entire episode is complete in order to compute the return (since the return is the sum of all future rewards).

For very long episodes, this means that it will take a long time before any updates are made to your agent.

For infinitely long episodes, it’s not possible to compute the return at all! (You’d have to wait an infinite amount of time to get an answer.)

TD learning uses the immediate reward $$r$$ and “estimates the rest” using the existing value of $$Q$$.

Therefore, your agent learns on each step, rather than waiting until the end of an episode.

The Q-Learning “target” is:

$$r + \gamma \max_a’ Q(s’, a’)$$

The main difference between SARSA and Q-Learning is that SARSA is on-policy while Q-Learning is off-policy.

This is because Q-Learning always uses the max, even when that action wasn’t taken. SARSA uses action $$a’$$ – whichever action was actually chosen for the next step. Q-Learning learns the value of the policy in which you take the max (even though your behavior policy may be different – e.g. epsilon-greedy). SARSA learns the value of the behavior policy.

## Expected SARSA

For Expected SARSA, the target is:

$$r + \gamma \sum_{a’} \pi( a’ | s’) Q(s’, a’)$$

This can also be written as:

$$r + \gamma E_\pi [Q(s’, a’) | s’]$$

Thus, Expected SARSA uses the expected action-value for the next state $$s’$$, over all actions $$a’$$, weighted according to the policy distribution.

Like regular SARSA, this should remind you of the Bellman Equation for Q (even more so than regular SARSA since it now properly sums over the policy distribution).

You can think of regular SARSA as merely drawing samples from the Expected SARSA target distribution. Put a different way, SARSA does what Expected SARSA does, in expectation.

So how does Expected SARSA unify SARSA and Q-Learning?

In fact, Q-Learning is merely a special case of Expected SARSA.

First, recognize that Expected SARSA can be on-policy or off-policy. In the off-policy case, one could act according to some behavior policy $$b(a | s)$$, while the target is computed according to the target policy $$\pi( a | s)$$.

If $$b(a | s)$$ is epsilon-greedy while $$\pi(a | s)$$ is greedy, then Expected SARSA is equivalent to Q-Learning.

On the other hand, Expected SARSA generalizes SARSA as well, as discussed above.

## Practical Considerations

1) Aside from all the theory above, the update itself is actually quite simple to implement. In some cases, you may find that it works better than SARSA or Q-Learning. Thus, it is a worthy addition to your RL toolbox and could form a part of your strategy for building agents to solve RL problems.

2) It’s slower than SARSA and Q-Learning since it requires summing over the action space on each update. So, you may have to decide whether the additional computation requirements are worth whatever increase in performance you observe.

# Monte Carlo with Importance Sampling for Reinforcement Learning

March 7, 2021

In this post, we’ll extend our toolset for Reinforcement Learning by considering the Monte Carlo method with importance sampling.

In my course, “Artificial Intelligence: Reinforcement Learning in Python“, you learn about the Monte Carlo method. But that’s just the beginning. There is still more that can be done to improve the agent’s learning capabilities.

## Review of Monte Carlo for Reinforcement Learning

Let’s begin by reviewing the regular Monte Carlo method covered in my Reinforcement Learning course.

Your job in a reinforcement learning task is to program an agent (characterized by a policy) that interacts with an environment (characterized by state transition dynamics). A picture of this process (more precisely, this article discusses a Markov Decision Process) is shown below:

The agent reads in a state $$S_t$$ and decides what action $$A_t$$ to perform based on the state. This is called the policy and can be characterized by a probability distribution, $$\pi( A_t | S_t)$$.

As the agent does this action, it changes the environment which results in the next state $$S_{t+1}$$. A reward signal $$R_{t+1}$$ is also given to the agent.

The goal of an agent is to maximize its sum of future rewards, called the return, $$G_t$$. The discounted return is defined as:

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + … + \gamma^{T – t – 1} R_T$$

Since both the policy and environment transitions can be random, the return can also be random. Because of this, we can’t maximize “the” return (since there are many possible values the return can ultimately be), but only the expected return.

The expected return given that the agent is in state $$S_t$$ and performs action $$A_t$$ at time $$t$$ is given by the Q-table. Specifically:

$$Q_\pi(s, a) = E_\pi[ G_t | S_t = s, A_t = a]$$

The Q-table can be used to determine what the best action will be since we can just choose whichever action $$a$$ maximizes $$Q(s,a)$$.

The problem is, we do not know $$Q(s,a)$$! Furthermore, we cannot calculate it directly since the expected value requires summing over the transition distribution $$p(s’, r | s, a)$$.

Generally speaking, this is unknown. e.g. Imagine building a self-driving car.

The Monte Carlo approach is to estimate the action-value using the sample average. i.e.

$$Q_\pi(s, a) \approx \frac{1}{N}\sum_{i=1}^{N} G^{(i)}(s,a)$$

Where $$G^{(i)}(s,a)$$ was the sample return when the agent was in state $$s$$ and performed action $$a$$ during the $$i$$’th episode.

Put simply: play a bunch of episodes, collect all the state-action-reward sequences, calculate the returns (by summing up the rewards), and then compute the average return for each state-action pair.

Easy!

How can we ensure that we visit every state-action pair so that the whole Q-table is filled up with a sufficient number of samples?

Practically, we usually employ some kind of exploration strategy, such as epsilon-greedy. With epsilon-greedy, we perform the optimal action $$1-\varepsilon$$ of the time, and we pick a random action $$\varepsilon$$ of the time. So $$\varepsilon$$ is the probability of exploration.

The problem with this approach is that it leads to a suboptimal policy. Why? It means that approximately $$\varepsilon$$ of the time, you are going to do something suboptimal!

(Note: it’s not exactly $$\varepsilon$$ since choosing an action randomly can still lead to choosing the optimal action by chance.)

This is where we transition to the main topic of this article, which is how Importance Sampling can help us overcome this problem.

Monte Carlo with epsilon-greedy exploration is called an on-policy control method, because the action-value (Q-table) being estimated corresponds to the policy that the agent is following.

On the other hand, off-policy methods allow the agent to act according to one policy (called the behavior policy), while the action-value is computed for a different, eventually optimal policy (called the target policy).

Henceforth we will denote the target policy as $$\pi( a | s)$$ and the behavior policy as $$b(a | s)$$.

## Importance Sampling

Suppose that we would like to estimate the expected value of some function $$s$$ of a random variable $$X$$ under some distribution $$\pi$$. We write this as:

$$E_\pi[ s(X) ]$$

If we know $$\pi$$, then (assuming $$X$$ is a discrete random variable) this can be computed as:

$$E_\pi[ s(X) ] = \sum_x \pi(x)s(x)$$

If we don’t know $$\pi$$, but $$X$$ is sampled according to $$\pi$$, then this expectation can be estimated by using the sample mean:

$$E_\pi[ s(X) ] \approx \frac{1}{N}\sum_{i=1}^N s(X_i)$$

Now suppose that something prevents us from gathering samples according to the distribution $$\pi$$, but it’s still possible to gather samples from a different distribution $$b$$. i.e. We want $$X \sim \pi$$ but we have $$X \sim b$$ instead.

In this case, the above sample average does not give us the desired expectation. We would be estimating $$E_b[ s(X)]$$ instead of $$E_\pi[ s(X) ]$$.

The importance sampling solution is found by recognizing the following equalities:

$$\begin{eqnarray} && E_\pi \left[ s(X) \right] \\ &=& \sum_x \pi(x)s(x) \\ &=& \sum_x \pi(x)s(x)\frac{b(x)}{b(x)} \\ &=& \sum_x b(x)s(x)\frac{\pi(x)}{b(x)} \\ &=& E_b \left[ s(X)\frac{\pi(X)}{b(X)} \right] \end{eqnarray}$$

This tells us that it’s possible to estimate the expectation under $$\pi$$ even when the samples are drawn according to $$b$$. All we have to do is multiply by the importance sampling ratio, $$\frac{\pi(X)}{b(X)}$$.

The only requirement is that $$b(x)$$ is not 0 when $$\pi(x)$$ is not 0. This is called “coverage”.

## Applying Importance Sampling to Reinforcement Learning

In reinforcement learning, the return $$G_t$$ is generated by acting according to the behavior policy $$b(a | s)$$ with transition dynamics $$p(s’, r | s, a)$$. But we would like to know the expectation of $$G_t$$ under the target policy $$\pi(a | s)$$ with the same transition dynamics.

In this case, the importance sampling ratio is a bit more complicated but can still be derived. $$G_t$$ is a sample from the distribution of $$p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b \mspace{5mu} \forall \tau)$$.

Basically this says: “the distribution of all the actions and states that happened after arriving in state $$S_t$$, following the policy $$b$$”.

The distribution we want the expectation with respect to is the same thing, but with actions drawn according to $$\pi$$. i.e. $$p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi \mspace{5mu} \forall \tau)$$.

Thanks to the Markov property these distributions are easy to expand.

$$p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b) = \prod_{\tau=t}^{T-1} b(A_\tau | S_\tau)p(S_{\tau+1} | S_\tau, A_\tau)$$

$$p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi) = \prod_{\tau=t}^{T-1} \pi(A_\tau | S_\tau)p(S_{\tau+1} | S_\tau, A_\tau)$$

The importance sampling ratio is then just:

$$\frac{p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi)}{p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b)} = \prod_{\tau=t}^{T-1} \frac{\pi(A_\tau | S_\tau)}{b(A_\tau | S_\tau)}$$

The transition dynamics cancel out because they are the same on both top and bottom.

This is convenient, because we know $$\pi$$ and we know $$b$$, but we do not know $$p$$ (which is why we have to use Monte Carlo in the first place).

Let’s define this importance sampling ratio using the symbol $$\rho$$.

$$\rho_{t:T-1} \dot{=} \prod_{\tau=t}^{T-1} \frac{\pi(A_\tau | S_\tau)}{b(A_\tau | S_\tau)}$$

Using this importance sampling ratio, we can estimate $$Q_\pi(s,a)$$ even though we are acting according to a different policy $$b$$ and using the returns generated from this other policy.

$$Q_\pi(s,a) \approx \frac{1}{N}\sum_{i=1}^N \rho^{(i)}(s,a) G^{(i)}(s,a)$$

Where $$G^{(i)}(s,a)$$ was the sample return when the agent was in state $$s$$ and performed action $$a$$ during the $$i$$’th episode, and $$\rho^{(i)}(s,a)$$ was the corresponding importance sampling ratio.

## Importance Sampling for Monte Carlo Implementation

At this point, you know all the theory. All that you have to do now is plug in the above importance sampling ratio in the appropriate places in your existing Monte Carlo code, and you’ll be doing Monte Carlo with importance sampling.

Here are some important considerations.

Like the return $$G_t$$, the importance sampling ratio is defined in terms of future values. i.e. the importance sampling ratio at time $$t$$ depends on the probabilities of the behavior and target policies at time $$t+1, t+2, …$$.

Therefore, it would make sense to loop backwards in time to compute this ratio, just like we loop backwards in time to compute the return.

Just like the return, the importance sampling ratio can be computed recursively.

Finally, you’ll recall that for the regular unweighted sample mean, it’s possible to perform constant-time updates every time we collect a new sample, instead of naively summing up all the samples and dividing by N.

$$Q^{(i)}(s,a) \leftarrow Q^{(i-1)}(s,a) – \frac{1}{i}(G^{(i)}(s,a) – Q^{(i-1)}(s,a))$$

Similarly, it’s possible to express the weighted sample mean update using a similar constant-time operation. i.e.

$$Q^{(i)}(s,a) \leftarrow Q^{(i-1)}(s,a) – \alpha^{(i)}(G^{(i)}(s,a) – Q^{(i-1)}(s,a))$$

As an exercise, try to derive what $$\alpha^{(i)}$$ should be.

Last point: I haven’t discussed weighted importance sampling, which can be used to reduce the variance of the estimate. The weighted importance sampling estimate looks like this:

$$Q_\pi(s,a) \approx \frac{\sum_{i=1}^N \rho^{(i)}(s,a) G^{(i)}(s,a)}{ \sum_{i=1}^N \rho^{(i)}(s,a) }$$

## Recap

Let’s review why this is different from regular Monte Carlo.

Regular Monte Carlo (what I covered in my Reinforcement Learning course) is an on-policy control method.

We use epsilon-greedy because exploration is required to collect enough samples for all state-action pairs. Epsilon-greedy is suboptimal by definition. Our Q-table and final policy will thus be suboptimal.

What might be better is an off-policy control method, where we act according to a behavior policy which allows for exploration, but compute the Q-table according to the target greedy policy (the optimal policy).

#reinforcement learning

# LeetCode in Machine Learning and Data Science (Episode 17)

January 15, 2021

Just the other day, I came across a forum post discussing why job interviews for ML engineers and data scientists often involve algorithms and coding. “I don’t want to be a software engineer!” you say.

Except, this is reality. And if you want a job, you have to face this reality.

In this free video, I will show you many of the insightful comments posted in the discussion, and we are going to complete an example LeetCode interview question from beginning to end. I’m also going to show you a typical answer given by someone who doesn’t know algorithms and why that answer is wrong.