The post Reinforcement Learning Algorithms: Expected SARSA appeared first on Lazy Programmer.

]]>In my course, “Artificial Intelligence: Reinforcement Learning in Python“, you learn about SARSA and Q-Learning, two popular TD methods. We’ll see how Expected SARSA unifies the two.

**Before we continue, just a gentle reminder that the VIP discount coupons for Financial Engineering and PyTorch: Deep Learning and Artificial Intelligence are expiring in just one week!**

CLICK HERE to get 75% OFF “Financial Engineering and Artificial Intelligence in python”

Topics covered:

- Exploratory data analysis, significance testing, correlations
- Alpha and beta
- Time series analysis, simple moving average, exponentially-weighted moving average
- Holt-Winters exponential smoothing model
- ARIMA and SARIMA
- Efficient Market Hypothesis
- Random Walk Hypothesis
- Time series forecasting (“stock price prediction”)
- Modern portfolio theory
- Efficient frontier / Markowitz bullet
- Mean-variance optimization
- Maximizing the Sharpe ratio
- Convex optimization with Linear Programming and Quadratic Programming
- Capital Asset Pricing Model (CAPM)
- Algorithmic trading

CLICK HERE to get 75% OFF “PyTorch: Deep Learning and Artificial Intelligence in python”

Topics covered:- Machine learning basics (linear neurons)
- ANNs, CNNs, and RNNs for images and sequence data
- Time series forecasting and stock predictions (+ why all those fake data scientists are doing it wrong)
- NLP (natural language processing)
- Recommender systems
- Transfer learning for computer vision
- GANs (generative adversarial networks)
- Deep reinforcement learning and applying it by building a stock trading bot

Let’s begin by reviewing the regular TD methods covered in my Reinforcement Learning course.

Your job in a reinforcement learning task is to program an agent (characterized by a policy) that interacts with an environment (characterized by state transition dynamics). A picture of this process (more precisely, this article discusses a Markov Decision Process) is shown below:

The agent reads in a state \( S_t \) and decides what action \( A_t \) to perform based on the state. This is called the policy and can be characterized by a probability distribution, \( \pi( A_t | S_t) \).

As the agent does this action, it changes the environment which results in the next state \( S_{t+1} \). A reward signal \( R_{t+1} \) is also given to the agent.

The goal of an agent is to maximize its sum of future rewards, called the return, \( G_t \). The discounted return is defined as:

$$ G_t \dot{=} R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + … + \gamma^{T – t – 1} R_T $$

One important relationship we can infer from the definition of the return is:

$$ G_t = R_{t + 1} + \gamma G_{t+1} $$

Since both the policy and environment transitions can be random, the return can also be random. Because of this, we can’t maximize “the” return (since there are many possible values the return can ultimately be), but only the expected return.

Taking the expected value of both sides and conditioning on the state \( S_t \), we arrive at the Bellman Equation:

$$ V_\pi(s) = E_\pi[ R_{t + 1} + \gamma V_\pi(S_{t+1}) | S_t = s] $$

If we condition on both the state \( S_t \) and the action \( A_t \), we get the Bellman Equation for the action-value:

$$ Q_\pi(s, a) = E_\pi \left[ R_{t + 1} + \gamma \sum_{a’} \pi(a’ |S_{t+1}) Q_\pi(S_{t+1}, a’) | S_t = s, A_t = a \right] $$

This will be the important relationship to consider when we learn about Expected SARSA.

Let’s go back a few steps.

The expected return given that the agent is in state \( S_t \) and performs action \( A_t \) at time \( t \) is given by the Q-table. Specifically:

$$ Q_\pi(s, a) = E_\pi[ G_t | S_t = s, A_t = a] $$

The Q-table can be used to determine what the best action will be since we can just choose whichever action \( a \) maximizes \( Q(s,a) \).

The problem is, we do not know \( Q(s,a) \)! Furthermore, we cannot calculate it directly since the expected value requires summing over the transition distribution \( p(s’, r | s, a) \).

Generally speaking, this is unknown. e.g. Imagine building a self-driving car.

The full Monte Carlo approach is to estimate the action-value using the sample average. i.e.

$$ Q_\pi(s, a) \approx \frac{1}{N}\sum_{i=1}^{N} G^{(i)}(s,a) $$

As you recall, it’s possible to convert the formula for the sample mean into a recursive update – something that looks a bit like gradient descent.

This is convenient so that we can update \(Q\) each time we receive a new sample without having to sum up all \(N\) values each time. If we did that, our computations would get longer and longer as \(N\) grows!

The recursive update looks like this:

$$ Q_\pi^{(N)}(s, a) \leftarrow Q_\pi^{(N-1)}(s, a) + \alpha \left[ G^{(N)}(s,a) – Q_\pi^{(N-1)}(s, a) \right] $$

This shows us how to get the \(N\)th estimate from the \(N-1\)th estimate. By setting \( \alpha = \frac{1}{N} \), we get exactly the sample mean (this was derived in the course).

The Temporal Difference approach uses “bootstrapping”, where we use *existing* values in \(Q_\pi(s, a)\) to estimate the expected return.

The SARSA update looks as follows:

$$ Q_\pi^{(N)}(s, a) \leftarrow Q_\pi^{(N-1)}(s, a) + \alpha \left[ r + \gamma Q_\pi^{(N-1)}(s’, a’) – Q_\pi^{(N-1)}(s, a) \right] $$

Essentially, we have replaced the old “target”:

$$ G(s,a) $$

with a new “target”:

$$ r + \gamma Q(s’, a’) $$

This should remind you of the right-hand side of the Bellman Equation for Q, with all the expected values removed.

What is the significance of this?

With Monte Carlo, we would have to wait until the entire episode is complete in order to compute the return (since the return is the sum of all future rewards).

For very long episodes, this means that it will take a long time before any updates are made to your agent.

For infinitely long episodes, it’s not possible to compute the return at all! (You’d have to wait an infinite amount of time to get an answer.)

TD learning uses the immediate reward \( r \) and “estimates the rest” using the existing value of \( Q \).

Therefore, your agent learns on each step, rather than waiting until the end of an episode.

The Q-Learning “target” is:

$$ r + \gamma \max_a’ Q(s’, a’) $$

The main difference between SARSA and Q-Learning is that SARSA is *on-policy* while Q-Learning is *off-policy*.

This is because Q-Learning always uses the max, even when that action wasn’t taken. SARSA uses action \( a’ \) – whichever action was actually chosen for the next step. Q-Learning learns the value of the policy in which you take the max (even though your behavior policy may be different – e.g. epsilon-greedy). SARSA learns the value of the behavior policy.

For Expected SARSA, the target is:

$$ r + \gamma \sum_{a’} \pi( a’ | s’) Q(s’, a’) $$

This can also be written as:

$$ r + \gamma E_\pi [Q(s’, a’) | s’] $$

Thus, Expected SARSA uses the *expected* action-value for the next state \( s’ \), over all actions \( a’ \), weighted according to the policy distribution.

Like regular SARSA, this should remind you of the Bellman Equation for Q (even more so than regular SARSA since it now properly sums over the policy distribution).

You can think of regular SARSA as merely drawing samples from the Expected SARSA target distribution. Put a different way, SARSA does what Expected SARSA does, in expectation.

So how does Expected SARSA unify SARSA and Q-Learning?

In fact, Q-Learning is merely a special case of Expected SARSA.

First, recognize that Expected SARSA can be on-policy or off-policy. In the off-policy case, one could act according to some behavior policy \( b(a | s) \), while the target is computed according to the target policy \( \pi( a | s) \).

If \( b(a | s) \) is epsilon-greedy while \( \pi(a | s) \) is greedy, then Expected SARSA is equivalent to Q-Learning.

On the other hand, Expected SARSA generalizes SARSA as well, as discussed above.

1) Aside from all the theory above, the update itself is actually quite simple to implement. In some cases, you may find that it works better than SARSA or Q-Learning. Thus, it is a worthy addition to your RL toolbox and could form a part of your strategy for building agents to solve RL problems.

2) It’s slower than SARSA and Q-Learning since it requires summing over the action space on each update. So, you may have to decide whether the additional computation requirements are worth whatever increase in performance you observe.

The post Reinforcement Learning Algorithms: Expected SARSA appeared first on Lazy Programmer.

]]>The post Monte Carlo with Importance Sampling for Reinforcement Learning appeared first on Lazy Programmer.

]]>In my course, “Artificial Intelligence: Reinforcement Learning in Python“, you learn about the Monte Carlo method. But that’s just the beginning. There is still more that can be done to improve the agent’s learning capabilities.

**Before we continue, just a gentle reminder that the VIP discount coupons for Financial Engineering and PyTorch: Deep Learning and Artificial Intelligence are expiring in just one week!**

CLICK HERE to get 75% OFF “Financial Engineering and Artificial Intelligence in python”

Topics covered:

- Exploratory data analysis, significance testing, correlations
- Alpha and beta
- Time series analysis, simple moving average, exponentially-weighted moving average
- Holt-Winters exponential smoothing model
- ARIMA and SARIMA
- Efficient Market Hypothesis
- Random Walk Hypothesis
- Time series forecasting (“stock price prediction”)
- Modern portfolio theory
- Efficient frontier / Markowitz bullet
- Mean-variance optimization
- Maximizing the Sharpe ratio
- Convex optimization with Linear Programming and Quadratic Programming
- Capital Asset Pricing Model (CAPM)
- Algorithmic trading

CLICK HERE to get 75% OFF “PyTorch: Deep Learning and Artificial Intelligence in python”

Topics covered:- Machine learning basics (linear neurons)
- ANNs, CNNs, and RNNs for images and sequence data
- Time series forecasting and stock predictions (+ why all those fake data scientists are doing it wrong)
- NLP (natural language processing)
- Recommender systems
- Transfer learning for computer vision
- GANs (generative adversarial networks)
- Deep reinforcement learning and applying it by building a stock trading bot

Let’s begin by reviewing the regular Monte Carlo method covered in my Reinforcement Learning course.

Your job in a reinforcement learning task is to program an agent (characterized by a policy) that interacts with an environment (characterized by state transition dynamics). A picture of this process (more precisely, this article discusses a Markov Decision Process) is shown below:

The agent reads in a state \( S_t \) and decides what action \( A_t \) to perform based on the state. This is called the policy and can be characterized by a probability distribution, \( \pi( A_t | S_t) \).

As the agent does this action, it changes the environment which results in the next state \( S_{t+1} \). A reward signal \( R_{t+1} \) is also given to the agent.

The goal of an agent is to maximize its sum of future rewards, called the return, \( G_t \). The discounted return is defined as:

$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + … + \gamma^{T – t – 1} R_T $$

Since both the policy and environment transitions can be random, the return can also be random. Because of this, we can’t maximize “the” return (since there are many possible values the return can ultimately be), but only the expected return.

The expected return given that the agent is in state \( S_t \) and performs action \( A_t \) at time \( t \) is given by the Q-table. Specifically:

$$ Q_\pi(s, a) = E_\pi[ G_t | S_t = s, A_t = a] $$

The Q-table can be used to determine what the best action will be since we can just choose whichever action \( a \) maximizes \( Q(s,a) \).

The problem is, we do not know \( Q(s,a) \)! Furthermore, we cannot calculate it directly since the expected value requires summing over the transition distribution \( p(s’, r | s, a) \).

Generally speaking, this is unknown. e.g. Imagine building a self-driving car.

The Monte Carlo approach is to estimate the action-value using the sample average. i.e.

$$ Q_\pi(s, a) \approx \frac{1}{N}\sum_{i=1}^{N} G^{(i)}(s,a) $$

Where \( G^{(i)}(s,a) \) was the sample return when the agent was in state \( s \) and performed action \( a \) during the \( i \)’th episode.

Put simply: play a bunch of episodes, collect all the state-action-reward sequences, calculate the returns (by summing up the rewards), and then compute the average return for each state-action pair.

Easy!

How can we ensure that we visit *every* state-action pair so that the whole Q-table is filled up with a sufficient number of samples?

Practically, we usually employ some kind of exploration strategy, such as epsilon-greedy. With epsilon-greedy, we perform the optimal action \( 1-\varepsilon \) of the time, and we pick a random action \( \varepsilon \) of the time. So \( \varepsilon \) is the probability of exploration.

The problem with this approach is that it leads to a *suboptimal* policy. Why? It means that approximately \( \varepsilon \) of the time, you are going to do something suboptimal!

(Note: it’s not exactly \( \varepsilon \) since choosing an action randomly can still lead to choosing the optimal action by chance.)

This is where we transition to the main topic of this article, which is how Importance Sampling can help us overcome this problem.

Monte Carlo with epsilon-greedy exploration is called an on-policy control method, because the action-value (Q-table) being estimated corresponds to the policy that the agent is following.

On the other hand, off-policy methods allow the agent to act according to one policy (called the *behavior* policy), while the action-value is computed for a different, eventually optimal policy (called the *target* policy).

Henceforth we will denote the target policy as \( \pi( a | s) \) and the behavior policy as \( b(a | s) \).

Suppose that we would like to estimate the expected value of some function \( s \) of a random variable \( X \) under some distribution \( \pi \). We write this as:

$$ E_\pi[ s(X) ] $$

If we know \( \pi \), then (assuming \( X \) is a discrete random variable) this can be computed as:

$$ E_\pi[ s(X) ] = \sum_x \pi(x)s(x) $$

If we don’t know \( \pi \), but \( X \) is sampled according to \( \pi \), then this expectation can be estimated by using the sample mean:

$$ E_\pi[ s(X) ] \approx \frac{1}{N}\sum_{i=1}^N s(X_i) $$

Now suppose that something prevents us from gathering samples according to the distribution \( \pi \), but it’s still possible to gather samples from a different distribution \( b \). i.e. We want \( X \sim \pi \) but we have \( X \sim b \) instead.

In this case, the above sample average does *not* give us the desired expectation. We would be estimating \( E_b[ s(X)] \) instead of \( E_\pi[ s(X) ] \).

The importance sampling solution is found by recognizing the following equalities:

$$\begin{eqnarray}

&& E_\pi \left[ s(X) \right] \\

&=& \sum_x \pi(x)s(x) \\

&=& \sum_x \pi(x)s(x)\frac{b(x)}{b(x)} \\

&=& \sum_x b(x)s(x)\frac{\pi(x)}{b(x)} \\

&=& E_b \left[ s(X)\frac{\pi(X)}{b(X)} \right]

\end{eqnarray}$$

This tells us that it’s possible to estimate the expectation under \( \pi \) even when the samples are drawn according to \( b \). All we have to do is multiply by the importance sampling ratio, \( \frac{\pi(X)}{b(X)} \).

The only requirement is that \( b(x) \) is not 0 when \( \pi(x) \) is not 0. This is called “coverage”.

In reinforcement learning, the return \( G_t \) is generated by acting according to the behavior policy \( b(a | s) \) with transition dynamics \( p(s’, r | s, a) \). But we would like to know the expectation of \( G_t \) under the target policy \( \pi(a | s) \) with the same transition dynamics.

In this case, the importance sampling ratio is a bit more complicated but can still be derived. \( G_t \) is a sample from the distribution of \( p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b \mspace{5mu} \forall \tau) \).

Basically this says: “the distribution of all the actions and states that happened after arriving in state \( S_t \), following the policy \( b \)”.

The distribution we want the expectation with respect to is the same thing, but with actions drawn according to \( \pi \). i.e. \( p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi \mspace{5mu} \forall \tau) \).

Thanks to the Markov property these distributions are easy to expand.

$$ p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b) = \prod_{\tau=t}^{T-1} b(A_\tau | S_\tau)p(S_{\tau+1} | S_\tau, A_\tau)$$

$$ p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi) = \prod_{\tau=t}^{T-1} \pi(A_\tau | S_\tau)p(S_{\tau+1} | S_\tau, A_\tau)$$

The importance sampling ratio is then just:

$$ \frac{p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi)}{p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b)} = \prod_{\tau=t}^{T-1} \frac{\pi(A_\tau | S_\tau)}{b(A_\tau | S_\tau)}$$

The transition dynamics cancel out because they are the same on both top and bottom.

This is convenient, because we know \( \pi \) and we know \( b \), but we do not know \( p \) (which is why we have to use Monte Carlo in the first place).

Let’s define this importance sampling ratio using the symbol \( \rho \).

$$ \rho_{t:T-1} \dot{=} \prod_{\tau=t}^{T-1} \frac{\pi(A_\tau | S_\tau)}{b(A_\tau | S_\tau)}$$

Using this importance sampling ratio, we can estimate \( Q_\pi(s,a) \) even though we are acting according to a different policy \( b \) and using the returns generated from this other policy.

$$ Q_\pi(s,a) \approx \frac{1}{N}\sum_{i=1}^N \rho^{(i)}(s,a) G^{(i)}(s,a) $$

Where \( G^{(i)}(s,a) \) was the sample return when the agent was in state \( s \) and performed action \( a \) during the \( i \)’th episode, and \( \rho^{(i)}(s,a) \) was the corresponding importance sampling ratio.

At this point, you know all the theory. All that you have to do now is plug in the above importance sampling ratio in the appropriate places in your existing Monte Carlo code, and you’ll be doing Monte Carlo with importance sampling.

Here are some important considerations.

Like the return \( G_t \), the importance sampling ratio is defined in terms of *future* values. i.e. the importance sampling ratio at time \( t \) depends on the probabilities of the behavior and target policies at time \( t+1, t+2, … \).

Therefore, it would make sense to loop backwards in time to compute this ratio, just like we loop backwards in time to compute the return.

Just like the return, the importance sampling ratio can be computed *recursively*.

Finally, you’ll recall that for the regular unweighted sample mean, it’s possible to perform constant-time updates every time we collect a new sample, instead of naively summing up all the samples and dividing by N.

$$ Q^{(i)}(s,a) \leftarrow Q^{(i-1)}(s,a) – \frac{1}{i}(G^{(i)}(s,a) – Q^{(i-1)}(s,a)) $$

Similarly, it’s possible to express the *weighted* sample mean update using a similar constant-time operation. i.e.

$$ Q^{(i)}(s,a) \leftarrow Q^{(i-1)}(s,a) – \alpha^{(i)}(G^{(i)}(s,a) – Q^{(i-1)}(s,a)) $$

As an exercise, try to derive what \( \alpha^{(i)} \) should be.

Last point: I haven’t discussed weighted importance sampling, which can be used to reduce the variance of the estimate. The weighted importance sampling estimate looks like this:

$$ Q_\pi(s,a) \approx \frac{\sum_{i=1}^N \rho^{(i)}(s,a) G^{(i)}(s,a)}{ \sum_{i=1}^N \rho^{(i)}(s,a) } $$

Let’s review why this is different from regular Monte Carlo.

Regular Monte Carlo (what I covered in my Reinforcement Learning course) is an *on-policy *control method.

We use epsilon-greedy because exploration is *required* to collect enough samples for all state-action pairs. Epsilon-greedy is suboptimal by definition. Our Q-table and final policy will thus be suboptimal.

What might be better is an *off-policy* control method, where we act according to a behavior policy which allows for exploration, but compute the Q-table according to the target greedy policy (the optimal policy).

The post Monte Carlo with Importance Sampling for Reinforcement Learning appeared first on Lazy Programmer.

]]>The post Why bad programmers always need the latest version appeared first on Lazy Programmer.

]]>Today, I’ve got 2 exciting things for you.

First, now is your chance to VOTE to tell me what you want to see in my next course. Transformers? Time Series Analysis? More advanced GANs? More advanced Reinforcement Learning? Let me know in this survey (it’s anonymous):

https://forms.gle/CNjjdaffSpi67qvq9

Second, check out the latest episode of The Lazy Programmer Show, where I discuss why bad programmers are always trying to get the latest version of some language or library, and why they tend to freak out when things are not in whatever version they happen to be using.

I look at this topic from 3 different perspectives, including:

1) What is it like in the “real world” working at a “real job”?

2) What kind of skills should a typical (competent) programmer have?

3) What are students learning and how do they approach machine learning and coding in colleges? Remember those students become “new grads” and those “new grads” become “junior engineers”.

What would your boss say if a junior engineer could run circles around you, a so-called “professional”?

Click here to watch the video https://www.youtube.com/watch?v=BIXH_m6CT2I or click the image below:

The post Why bad programmers always need the latest version appeared first on Lazy Programmer.

]]>The post Tensorflow 2 One Year Later: What do I think now? (+PyTorch, JAX, Julia) appeared first on Lazy Programmer.

]]>Is it still good?

Is it still worth learning?

Why does stuff keep breaking / changing?

How does it compare to other offerings? (PyTorch, JAX, Julia)

Did you know JAX was created by Google? (thus, using Tensorflow doesn’t equate to “doing what Google’s doing”)

Julia is a totally different programming language popular with many data scientists and machine learning engineers. Will it replace Python?

PLUS, a little bonus (but you’ll have to watch the video to see what it is)

Check out the video here: https://youtu.be/

Or just watch it here:

The post Tensorflow 2 One Year Later: What do I think now? (+PyTorch, JAX, Julia) appeared first on Lazy Programmer.

]]>The post NEW COURSE: Financial Engineering and Artificial Intelligence in Python appeared first on Lazy Programmer.

]]>Hello once again friends!

Today, I am announcing the VIP version of my latest course: Financial Engineering and Artificial Intelligence in Python.

If you don’t want to read my little spiel just click here to get your VIP coupon:

~~https://www.udemy.com/course/ai-finance/?couponCode=FINANCEVIP (expires Oct 9, 2020)~~

https://www.udemy.com/course/ai-finance/?couponCode=FINANCEVIP7 (expires Apr 15, 2021)

*(as usual, this coupon lasts only 30 days, so don’t wait!)*

This is a MASSIVE (19 hours) Financial Engineering course covering the core fundamentals of financial engineering and financial analysis from scratch. We will go in-depth into all the classic topics, such as:

- Exploratory data analysis, significance testing, correlations
- Alpha and beta
- Advanced Pandas Data Frame manipulation for time series and finance
**Time series analysis**, simple moving average, exponentially-weighted moving average- Holt-Winters exponential smoothing model
- ARIMA and SARIMA
- Efficient Market Hypothesis
- Random Walk Hypothesis
**Time series forecasting**(“stock price prediction”)**Modern portfolio theory**- Efficient frontier / Markowitz bullet
- Mean-variance optimization
- Maximizing the Sharpe ratio
- Convex optimization with Linear Programming and Quadratic Programming
- Capital Asset Pricing Model (CAPM)
**Algorithmic trading**

In addition, we will look at various non-traditional techniques which stem purely from the field of machine learning and artificial intelligence, such as:

- Regression models
- Classification models
- Unsupervised learning
- Reinforcement learning and Q-learning

We will learn about the greatest flub made in the past decade by marketers posing as “machine learning experts” who promise to teach unsuspecting students how to “predict stock prices with LSTMs”. You will learn exactly why their methodology is fundamentally flawed and why their results are complete nonsense. It is a lesson in how not to apply AI in finance.

As with my Tensorflow 2 release, some of the VIP content will be a surprise and will be released in stages. Currently, the entirety of the Algorithmic Trading sections are VIP sections. Newly added VIP sections include Statistical Factor Models and “The Lazy Programmer Bonus Offer”. Here’s a full list:

**Classic Algorithmic Trading – Trend Following Strategy**

You will learn how moving averages can be applied to do algorithmic trading.

**Machine Learning-Based Trading Strategy**

Forecast returns in order to determine when to buy and sell.

**Reinforcement Learning-Based (Q-Learning) Trading Strategy**

I give you a full introduction to Reinforcement Learning from scratch, and then we apply it to build a Q-Learning trader. Note that this is *not* the same as the example I used in my Tensorflow 2, PyTorch, and Reinforcement Learning courses. I think the example included in this course is much more principled and robust.

**Statistical Factor Models**

The CAPM is one of the most renowned financial models in history, but did you know it’s only the *simplest* factor model, with just a single factor? To go beyond just this single factor model, we will learn about statistical factor models, where the multiple “factors” are found automatically using only the data.

**The Lazy Programmer Bonus Offer**

There are marketers out there who want to capitalize on your enthusiastic interest in finance, and unfortunately what they are teaching you is *utter and complete garbage*.

They will claim that they can “predict stock prices with LSTMs” and show you charts like this with nearly perfect stock price predictions.

Hint: if they can do this, why do they bother putting effort into making courses? Wouldn’t they already be billionaires?

**Have you ever wondered if you are taking such a course from a fake data scientist / marketer? **If so, just send me a message, and I will tell you whether or not you are taking such a course. (Hint: many of you are) I will give you a list of mistakes they made so you can look out for them yourself, and avoid “learning” things which will ultimately make YOU look very bad in front of potential future employers.

Believe me, if you ever try to get a job in machine learning or data science and you talk about a project where you “predicted stock prices with LSTMs”, all you will be demonstrating is *how incompetent you are*.

Save yourself from this embarrassing scenario by taking the “Lazy Programmer Offer”!

**Please note: **The VIP coupon will work only for the next month (starting from the coupon creation time). It’s unknown whether the VIP period will renew after that time.

After that, although the VIP content will be removed from Udemy, all who purchased the VIP course will get permanent free access to these VIP contents on deeplearningcourses.com.

In case it’s not clear, the process is *very easy*. For those folks who need the “step-by-step” instructions…:

STEP 1) I announce the VIP content will be removed.

STEP 2) You email me with proof that you purchased the course during the VIP period. Do NOT email me earlier as it will just get buried.

STEP 3) I will give you free access to the VIP materials for this course on deeplearningcourses.com.

- Learn the knowledge you need to work at top tier investment firms
- Gain practical, real-world quantitative skills that can be applied within and outside of finance
- Make better decisions regarding your own finances

Personally, I think this is the most *interesting* and *action-packed* course I have created yet. My last few courses were cool, but they were all about topics which I had already covered in the past! GANs, NLP, Transfer Learning, Recommender Systems, etc etc. all just machine learning topics I have covered several times in different libraries. This course contains new, fresh content and concepts I have never covered in any of my courses, ever.

This is the first course I’ve created that extends into a niche area of AI application. It goes *outside of AI *and into domain expertise. An in-depth topic such as finance deserves its own course. This is that course. These are topics you will never learn in a generic data science or machine learning course. However, as a student of AI, you will recognize many of our tools and methods being applied, such as statistical inference, supervised and unsupervised learning, convex optimization, and optimal control. This allows us to go *deeper *than your run of the mill financial engineering course, and it becomes more than just the sum of its parts.

So what are you waiting for?

The post NEW COURSE: Financial Engineering and Artificial Intelligence in Python appeared first on Lazy Programmer.

]]>The post How to Build Your Own Computer Science Degree appeared first on Lazy Programmer.

]]>

Summary of the video:

The following books can be used to study core computer science topics at the college / university level, to prepare yourself for machine learning, deep learning, artificial intelligence, and data science.

These are the books I recommend for building your own computer science degree. Remember! The goal is to do as many exercises as you can. It’s not to just watch 5 minute YouTube videos and then conclude “I understand everything! There’s no need for exercises!”

This quote from the video sums it up nicely: if you don’t find the problems, the problems will find you.

These books cover common core courses that are relevant for many sub-fields of Computer Science and Engineering, including Machine Learning et. al., but also related fields such as operations research, statistics, quantitative finance, software engineering, digital communications, wireless communications, control systems (e.g. autopilot), robotics, and many more.

To recap, these are the courses and why you want to take them:

Calculus

Nearly all machine learning algorithms boil down to optimization problems. What is optimization? Generally speaking, it’s when you have a function and you want to maximize or minimize that function.

If you’ve taken calculus, then you should recall that this is exactly what you learn how to do in calculus.

Therefore, calculus is an essential tool in artificial intelligence, data science, etc.

Linear Algebra

In machine learning and deep learning especially, we work with vectors, matrices, and higher-dimensional objects. This is the realm of linear algebra.

Luckily, you don’t have to go *that far* in linear algebra to get what you need for deep learning and AI.

For example, the concept of spans, subspaces, rank, etc. rarely show up in machine learning.

On the other hand, the basics, such as matrix and vector multiplication and eigenvalues and eigenvectors, show up often.

Probability

Probability is *the* language you must speak if you want to do machine learning and AI.

Recall above that machine learning often boils down to an optimization. What are we trying to optimize? Often, it’s an *expected value*. What is an expected value? Well, you have to learn probability to find that out.

For a time, “probabilistic graphical models” were the state of the art in artificial intelligence. Clearly, probability would be a prerequisite.

Probability shows up nearly everywhere in machine learning, from soft k-means clustering to hidden Markov models to artificial neural networks and beyond.

Side note: if you were thinking earlier, “who needs calculus when Tensorflow can do it for me!?”, think again. Calculus is a prerequisite to probability. So if you want to learn probability, you still need calculus anyway.

Programming

Obviously, at some point, you need to be able to actually write a computer program in order to use machine learning.

Nowadays, things can seem very simple when all you need is 3 lines of boilerplate code to use scikit-learn.

However, that’s not really what one should imagine when they think of “learning machine learning”.

Check any college-level machine learning course (not that bullshit being sold by marketers online) to confirm what I am saying.

As the great physicist Richard Feynman once said, “What I cannot create, I do not understand”.

In order to best understand a machine learning algorithm, you should implement it.

No, you are not “reinventing the wheel”. This is called “learning”.

I would posit that if you can’t implement basic algorithms like k-means clustering, logistic regression, k-nearest neighbor, and naive Bayes, you do *not* understand those algorithms.

So why do I suggest Java over something like Python, which has easily become the most popular language for doing data science.

The problem with Python is that it’s too high level. It doesn’t make you *think* about the program you are writing at the algorithmic level.

You should understand the difference between an efficient and an inefficient algorithm. (No, that doesn’t mean memorizing facts like “Python list comprehensions are better than for loops”).

In fact you should recognize that list comprehensions have the exact same time complexity as your for loop.

Java, being slightly lower level, forces you to think algorithmically.

That brings us to the final topic.

Algorithms and Data Structures

In order to really understand algorithms, you should study… algorithms.

There are many famous algorithms contained in the book I’ve suggested below.

Realistically, you are not going to use these in your day to day work (a very common complaint from software developers seeking employment).

However, that’s not really the point.

The point is exercising your brain and learning how to think in certain ways that help you write better code.

Also, you should understand the pros and cons of basic data structures such as lists, sets, dictionaries, and trees.

If you’re coding up some algorithm, why might a set be better than a list? Algorithms tell you why.

One major data structure you might want to learn about is *graphs* (along with their associated algorithms). Graph neural networks seem to be picking up steam, and they are being used for all kinds of interesting problems like social network analysis, chemistry, and more.

To summarize: the core courses I would consider essential for building your own Computer Science degree in preparation for machine learning, deep learning, data science, and artificial intelligence are calculus, linear algebra, probability, programming, and algorithms.

Don’t just watch a bunch of videos on YouTube or Khan Academy and then proclaim you understand the subject. The reason I’ve suggested books is because they contain exercises / homework problems. These are what you must be able to do in order to claim that you understand something. It’s not about “absorbing information”, it’s about “producing useful output”.

Common question: What about C++? Yes, C++ is excellent! Ideally, you will learn both C++ and Java, but obviously, these are not hard prerequisites for machine learning or data science.

Disclaimer: this post contains Amazon affiliate links.

The post How to Build Your Own Computer Science Degree appeared first on Lazy Programmer.

]]>The post Why does “how to succeed in this course” exist? appeared first on Lazy Programmer.

]]>Note that in some courses, this may have been replaced with the alternate lecture, “anyone can succeed in this course”.

Every now and then, this lecture *really offends someone* (probably because it hits one or two of their “sensitive spots” and they know that they are making some of the mistakes I mention in that lecture). But I’d rather not speculate, and furthermore, there is no reason to.

Let’s get down to business:

Your interpretation of this lecture is wrong.

Your main mistake is that you haven’t thought of your fellow students (and there are thousands of them).

Obviously, this lecture wouldn’t have been made if it were not deemed necessary.

Common sense (hopefully) should tell you that these “strange statements” (yes, someone really called them that, haha!) weren’t invented out of thin air.

Clearly, based on this lecture, you can conclude that:

– there are students who don’t follow instructions and need a reminder

– there are students who don’t meet the prerequisites and need a reminder

– there are students who believe all the prerequisites can be included in this course (not true)

– there are students who believe they can understand this course without meeting the prerequisites (not true)

– there are students who make up self-defeating excuses not to use the Q&A

– there are students who don’t know that coding is an exercise

– there are students who don’t know about the speed changer

– there are students who forget things and don’t realize that taking notes would have prevented them from forgetting

– there are students who believe their PhD absolves them from following the instructions or meeting the prerequisites

– etc.

Need I say more?

Finally, understand that the course is not personalized for you. It’s for everybody.

Therefore, if you find a lecture that doesn’t contain content you want to learn, simply skip it.

It should be very easy. Let me know if you can’t find the “skip” button.

In fact, this is stated in this very lecture, so maybe you should have watched it more carefully. 😉

The lesson is: be kind, courteous, understanding, and accommodating to your fellow students. Hopefully, this makes sense.

Common assertions:

it is patronizing

It’s only patronizing if you believe you are “above” this advice, in which case, you should double check that the problem isn’t that you think too highly of yourself. 😉

it is negative

Is today backwards day?

“Anyone can succeed in this course” is a negative statement?

The opposite of that is “Nobody can succeed in this course”. I suppose that would be positive?

Overall:

Nobody to this date has ever been able to provide a direct rebuttal to any of the points I have mentioned above.

If you think you can rebut anything I have said, I am actually so eager to hear what you have to say. I have been waiting **years** for someone to tell me why any of the above points are wrong!

The post Why does “how to succeed in this course” exist? appeared first on Lazy Programmer.

]]>The post New Exclusive Course: Linear Programming for Linear Regression in Python appeared first on Lazy Programmer.

]]>If you’ve been to deeplearningcourses.com recently, you will have noticed that there is now a section for **exclusive courses**. These are courses that will *not* be on any other platforms, only deeplearningcourses.com.

These are what I’ve been calling “mini-courses” during their development and that’s what they are in spirit. They are:

- Lower cost
- Shorter in duration

There won’t be any time spent on stuff like appendices which most of you have already seen and are mainly for beginners.

The point of these courses is to have a faster turn-around time on course development. Sometimes, there are topics I want to cover really quickly that won’t ever become a full-sized course. They will also be used to cover more **advanced** topics.

Unfortunately, a lot of students on other platforms (e.g. Udemy) are complete beginners who have no desire advance and gain actual skill. They take “marketer-taught” courses which leads to a complex which I call “confidence without ability”. Dealing with such students is draining.

These mini-courses will bring us back to the **old days** (many of you have been around since then!) where the material was more concise, straight-to-the-point, and didn’t need “beginner reminders” all over the place.

Given that these mini-courses are much simpler for me to make, I expect there to be many more in the future.

This first exclusive mini-course is on **Linear Programming for Linear Regression**.

Many students in my Linear Regression course often ask, “What if I want to use absolute error instead of squared error?” This course answers exactly that question and more.

The solution is based on Linear Programming (LP).

We will also cover 2 other common problems: maximum absolute deviation and positive-only (or negative-only) error.

These kinds of problems are often found in professional fields such as quantitative finance, operations research, and engineering.

Each of these problems can be solved using Linear Programming with the Scipy library.

BONUS FACT: I have a new pen and tablet set up so most of the derivations in this course are done by hand – really truly old-school like the Linear/Logistic Regression days!

Get the course here: https://

Another exclusive course which has already been on deeplearningcourses.com for some time is my original MATLAB course. This was the first course I ever made and is basically a collector’s item. The quality isn’t that great compared to what I am creating now, but obviously you will still learn a lot.

I’m including it in this newsletter to announce that I was able to dig up an extra section on probability that didn’t exist before. So the course now has 3 major sections:

- MATLAB basic operations and variables
- Signal processing with sound and images
- Probability and statistics

Get the course here: https://

The post New Exclusive Course: Linear Programming for Linear Regression in Python appeared first on Lazy Programmer.

]]>The post Beginners: How to get an infinite amount of practice and exercise in machine learning appeared first on Lazy Programmer.

]]>There are several ways to answer this.

First, let’s make an important distinction.

There’s a difference between putting in the work to understand an algorithm, and using that algorithm on data. We’ll call these the “learning phase” and the “application phase”.

Learning phase = Putting in the work to understand an algorithm

Application phase = Using that algorithm on data

Let’s take a simple example: linear regression.

In the **learning phase**, your tasks will include:

- Being able to derive the algorithm from first principles (that’s calculus, linear algebra, and probability)
- Implementing the algorithm in the language of your choice (it need not be Python)
- Testing your algorithm on data to verify that it works

These are essential tasks in ensuring that you really understand an algorithm.

Doing these tasks are “exercises” which improve your general aptitude in machine learning, and will strengthen your ability to learn other algorithms in the future, such as logistic regression, neural networks, etc.

As my famous motto goes: “**if you can’t implement it, then you don’t understand it**”.

Interestingly, 5 years after I invented this motto, I discovered that the famous physicist Richard Feynman said a very similar thing!

In order to get an infinite amount of practice in this area, you should learn about various extensions on this algorithm, such as L1 and L2 regularization, using gradient descent instead of the closed-form solution, 2nd order methods, etc.

You might want to try implementing it in a different language. And finally, you can spend a lifetime exercising your ability to understand machine learning algorithms by learning about more machine learning algorithms in much the same way.

Believe me, 10 years down the line you may discover something new and interesting about even the simplest models like Linear Regression.

The second phase is the “**application phase**”.

Here is where your ability to exercise and practice is really infinite.

Let’s first remember that I don’t know you personally. I don’t know what you care about, what field you are in, or what your motivations for learning this subject are.

Therefore, I cannot tell you where to apply what you’ve learned: only you know that.

For example, if you are a computational biologist, then you can use this algorithm on problems specific to computational biology.

If you are a financial engineer, then you can use this algorithm on problems specific to financial engineering.

Of course, because I am not a computational biologist, I don’t know what that data looks like, what the relevant features are, etc.

I can’t help you with that.

The “interface” where I end and you begin is the algorithm.

After I teach you how and why the algorithm works and how to implement it in code, using it to further scientific knowledge in your own field of study becomes your responsibility.

One can’t expect me to be an expert computational biologist and an expert financial engineer and whatever else it is that you are an expert in.

Therefore, you can’t rely on me to tell you what datasets you might be interested in, what kinds of problems you’re trying to solve, etc.

Presumably, since you’re the expert, you should know that yourself!

If you don’t, then you are probably not the expert you think you are.

But therein lies the key.

Once you’ve decided what you care about, you can start applying what you’ve learned to those datasets.

This will give you an infinite amount of practice, assuming you don’t run out of things to care about.

If you don’t care about anything, well then, why are you doing this in the first place? Lol.

This also ties nicely into another motto of mine: “**all data is the same**”.

What does this mean?

Let’s recall the basic “pattern” we see when we use scikit-learn:

model = LinearRegression() model.fit(X_train, Y_train) model.predict(X_test)

Does this code change whether (X, Y) represent a biology dataset or a finance dataset?

The answer is no!

Otherwise, no such library as Scikit-Learn could even exist!

“All data is the same” means that the same Linear Regression algorithm applies, no matter what field or what industry your data happens to come from.

There’s no such thing as “Linear Regression for biology” and “Linear Regression for finance”.

There’s only one linear regression that is the same linear regression no matter the dataset.

Thus, you learn the algorithm once, and you can apply it infinitely to any number of datasets!

Pretty cool huh?

But look, if you really have zero idea of what you care about, or your answer is “I care about machine learning”, then there are plenty of stock datasets that you can look up on your own.

These include Kaggle, the UCI repository, etc. There’s so much data out there, you will still have to pick and choose what to focus on first.

Again, you have to choose what you care about. Nobody else can possibly tell you that with any accuracy.

**Addendum**:

The “learning phase” above does not apply to situations where you’re learning an API (for example, Tensorflow 2, PyTorch, or even Scikit-Learn).

Why?

Well firstly, there’s nothing really to derive.

Secondly, it would be impossible for you to implement anything yourself without me showing you how first (at which point anything you type would amount to simply copying what I did).

Why?

Well, how would you know what to type if I didn’t show you?

Are you going to magically come up with the correct syntax for a library that you simply haven’t learned?

Obviously not. That would be a ludicrous idea.

In this case, the “learning phase” amounts to:

- Understanding the syntax I’ve shown you
- Being able to replicate that syntax on your own

This most closely represents what you will do in the “real-world”. In the real-world, you want to be able to write code fast and efficiently, rather than trying to remember which course and which lecture covered that exact syntax you’re thinking of.

Being able to write code on-the-fly makes you efficient and fast. Obviously, I don’t remember everything, but I know where to look when I need something. You have to find the right balance between memorizing and looking up. Just like how computer programs use caches and RAM for fast data retrieval instead of the hard drive.

Obviously, this will get better over time as you practice more and more.

It sounds overly simplistic, but it’s nothing more than repetition and muscle memory. I usually don’t explicitly commit to memorizing anything. I just write code and let it come naturally. The key is: I write code.

Obviously, watching me play tennis or reading books about tennis will not make you a better tennis player.

You must get on the court, pick up the tennis racket, and play actual tennis matches!

I can’t force you to do this. It’s the kind of thing which must be done of your own volition.

I mean, if you want to pay me consulting hours to call you and remind you to practice, I’d be very happy to oblige. =)

At this point, once you have completed the “learning phase” of learning an API, then the “application phase” described above still applies.

The post Beginners: How to get an infinite amount of practice and exercise in machine learning appeared first on Lazy Programmer.

]]>The post What Does “In-Depth” Mean? appeared first on Lazy Programmer.

]]>To do this, I will provide several examples.

Example 1) Typical Udemy Course

Example 2) Typical College (“Academic”) Course

Example 3) My Courses

A typical Udemy course on machine learning spends ~20 minutes talking about how “cool” and “powerful” an algorithm is (useless).

Then, the instructor spends ~20 minutes talking about “intuition” (maybe a few pictures here and there). Importantly, **no math is used, no pseudocode is presented.**

Then, the instructor spends ~30-45 minutes explaining 3 lines of code:

model = MyModel() model.fit(X, Y) model.predict(X)

**Why is this bad?**

- You still have no idea how the model really works. Your “intuition” is often wrong.
- Sometimes, even the instructor himself is wrong! (That’s because most instructors teaching these courses are marketers, not practicing data scientists.) They cannot teach the math or algorithms parts, because they don’t understand it themselves.
- Look at the code above. Is it even related to any algorithm? This code has no relation to understanding how the model works. As such, it is useless for learning “machine learning”.
- You can’t implement the model. As I always say, “If you can’t implement it, then you don’t understand it”. Or, as the famous physicist Richard Feynman once said, “What I cannot create, I do not understand”.
- You just paid someone to learn just 3 lines of code and some flakey “intuition”. You could have done that for FREE! Check the Scikit-Learn documentation.

If I only had a choice between #1 and #2, I would usually choose #2. Despite the problems listed below, a college course on ML is far superior to the typical “for dummies” Udemy course.

The typical college course proceeds as follows:

Show all the math and pseudocode (usually), but go through them at **super speed!** Usually, each slide contains like 3-5 equations (way too much packed onto one slide).

Good Example:

Yikes!

Luckily, college ML courses **do not use Scikit-Learn** (unlike typical Udemy courses). Students do get to practice implementing algorithms (good).

Unfortunately, because the class moves so quickly, they will not get to implement everything. Usually, each lecture will cover 1 or 2 algorithms (suppose there’s 12 lectures per semester, so you will learn ~12-20 different algorithms in the whole course – too many too fast in my opinion).

But, since you can’t have 20 homework assignments (don’t get any crazy ideas professors!), you will likely only get to implement a few (e.g. logistic regression, k-means clustering, neural networks).

The **end result**: the course goes so fast, that students end up learning very little. They have *some idea* how each algorithm works, but not **in-depth** (even though all the equations were shown, it was simply too fast).

**Another problem: **Because implementation is only a homework assignment, you never get to see a reference implementation. As you probably know, not everyone is an A+ student. Why?

Well, you lose marks when you get things wrong. Students coding these assignments, if they get a B or a C, it means they got some things wrong. However, they passed the course. But they never learned to code the algorithm correctly.

So even if you “pass the course”, you are still missing knowledge!

In my courses, I aim to fix these shortcomings of both the typical Udemy course and the typical academic course.

How? By going **in-depth** into each subject.

Specifically:

- Don’t cover every machine learning algorithm under the sun. Focus on one or a few algorithms per course.
- For each algorithm, derive any equations used
**from first principles**(typically, that means calculus, linear algebra, and probability, unless the course had other prerequisites) - For each algorithm, derive the “algorithmic” part / pseudocode from first principles
- For each algorithm, implement it in code from scratch. This requires
*true understanding*. It forces you to understand the equations and not skip over the details. It forces you to think about how the algorithm works, rather than just using flawed “intuition”.

**Advantage #1:** Anyone should be able to fully understand the material as long as they meet the prerequisites (usually basic calculus, linear algebra, probability, and programming). This is because everything is built up logically from these basic first principles.

**Advantage #2: **Because everything is **derived** from first principles, it’s unlike a typical academic course, where too many equations are shown all at once, and it’s not clear how the flow of logic proceeds unless you go through them slowly by yourself. Instead, each equation is shown one-by-one.

If you don’t understand it, then simply pause the video, and review it again, or use the Q&A to clarify your misunderstanding.

In a college class, things go by so fast you rarely have the opportunity to even think of the right question to ask before the lecturer moves on to the next subject.

**Advantage #3: **You get to implement everything from scratch. This means you have true understanding. Unlike the typical academic course, where implementation is just homework, I am going to show you a reference example of the code. So even if you get it wrong, you still have a chance to fix your mistakes.

It means:

- Deriving all the math (unlike Udemy courses)
- Deriving all the algorithms and pseudocode (unlike Udemy courses)
- Showing the implementation of the code (unlike academic courses)

Many beginners confuse the word “**depth**” with “**breadth**“.

For example, a beginner may come across a 20-40h course and proclaim, “This course covers so many topics! It is very in-depth!”

In actuality, it’s the opposite of deep. It’s shallow.

Spending 10 hours on basic Python, 1 hour on linear regression, 1 hour on logistic regression, etc. etc. means you only learned very basic stuff.

You didn’t learn anything “deep”. You only have superficial understanding.

Beginners often think that to be “in-depth”, a course must cover *many different algorithms*. That is the opposite of depth. That is called **breadth**.

Depth means going over the **details** of an algorithm. Specifically:

- Deriving each equation being used and the intuition behind them.
- Deriving any algorithms that must be put into code.
- Walking through the pseudocode.
- Implementing the actual code (pseudocode is nice, but it can’t run on your computer).

This is called “depth” because it goes over every nook and cranny of an algorithm or model, leaving no stone unturned.

“Breadth” is the opposite. This means covering many different topics in a **shallow fashion** (the word “shallow” is the opposite of “deep”).

Typically that involves what I described in Example 1 (very brief explanation, a few lines of code to call an API).

Breadth is useless. I can obtain breadth for free by reading the Scikit-Learn documentation.

The post What Does “In-Depth” Mean? appeared first on Lazy Programmer.

]]>