In this video, we’re going to do another coding interview question where I walk you through the process of:

– explaining the problem

– solving the problem

– analyzing the time complexity of the solution

These videos are designed to help you practice for future job interviews where you are required to answer technical whiteboard problems.

The problem we’ll solve can be viewed as a bioinformatics problem, but you don’t have to know anything about biology! Even those who only know basic coding should be able to solve this problem.

I hope this motivates you to realize that “bioinformatics” is not all that different from what you already know how to do – and therefore, contributing to this important and exciting field is totally within your reach.

Before we continue, just a gentle reminder that the VIP discount coupons for Financial Engineering and PyTorch: Deep Learning and Artificial Intelligence are expiring in just one week!

CLICK HERE to get 75% OFF “Financial Engineering and Artificial Intelligence in python”
Topics covered:

Exploratory data analysis, significance testing, correlations

Alpha and beta

Time series analysis, simple moving average, exponentially-weighted moving average

Holt-Winters exponential smoothing model

ARIMA and SARIMA

Efficient Market Hypothesis

Random Walk Hypothesis

Time series forecasting (“stock price prediction”)

Modern portfolio theory

Efficient frontier / Markowitz bullet

Mean-variance optimization

Maximizing the Sharpe ratio

Convex optimization with Linear Programming and Quadratic Programming

Let’s begin by reviewing the regular TD methods covered in my Reinforcement Learning course.

Your job in a reinforcement learning task is to program an agent (characterized by a policy) that interacts with an environment (characterized by state transition dynamics). A picture of this process (more precisely, this article discusses a Markov Decision Process) is shown below:

The agent reads in a state \( S_t \) and decides what action \( A_t \) to perform based on the state. This is called the policy and can be characterized by a probability distribution, \( \pi( A_t | S_t) \).

As the agent does this action, it changes the environment which results in the next state \( S_{t+1} \). A reward signal \( R_{t+1} \) is also given to the agent.

The goal of an agent is to maximize its sum of future rewards, called the return, \( G_t \). The discounted return is defined as:

One important relationship we can infer from the definition of the return is:

$$ G_t = R_{t + 1} + \gamma G_{t+1} $$

Since both the policy and environment transitions can be random, the return can also be random. Because of this, we can’t maximize “the” return (since there are many possible values the return can ultimately be), but only the expected return.

Taking the expected value of both sides and conditioning on the state \( S_t \), we arrive at the Bellman Equation:

The Q-table can be used to determine what the best action will be since we can just choose whichever action \( a \) maximizes \( Q(s,a) \).

The problem is, we do not know \( Q(s,a) \)! Furthermore, we cannot calculate it directly since the expected value requires summing over the transition distribution \( p(s’, r | s, a) \).

Generally speaking, this is unknown. e.g. Imagine building a self-driving car.

The full Monte Carlo approach is to estimate the action-value using the sample average. i.e.

$$ Q_\pi(s, a) \approx \frac{1}{N}\sum_{i=1}^{N} G^{(i)}(s,a) $$

As you recall, it’s possible to convert the formula for the sample mean into a recursive update – something that looks a bit like gradient descent.

This is convenient so that we can update \(Q\) each time we receive a new sample without having to sum up all \(N\) values each time. If we did that, our computations would get longer and longer as \(N\) grows!

The recursive update looks like this:

$$ Q_\pi^{(N)}(s, a) \leftarrow Q_\pi^{(N-1)}(s, a) + \alpha \left[ G^{(N)}(s,a) – Q_\pi^{(N-1)}(s, a) \right] $$

This shows us how to get the \(N\)th estimate from the \(N-1\)th estimate. By setting \( \alpha = \frac{1}{N} \), we get exactly the sample mean (this was derived in the course).

The Temporal Difference approach uses “bootstrapping”, where we use existing values in \(Q_\pi(s, a)\) to estimate the expected return.

The SARSA update looks as follows:

$$ Q_\pi^{(N)}(s, a) \leftarrow Q_\pi^{(N-1)}(s, a) + \alpha \left[ r + \gamma Q_\pi^{(N-1)}(s’, a’) – Q_\pi^{(N-1)}(s, a) \right] $$

Essentially, we have replaced the old “target”:

$$ G(s,a) $$

with a new “target”:

$$ r + \gamma Q(s’, a’) $$

This should remind you of the right-hand side of the Bellman Equation for Q, with all the expected values removed.

What is the significance of this?

With Monte Carlo, we would have to wait until the entire episode is complete in order to compute the return (since the return is the sum of all future rewards).

For very long episodes, this means that it will take a long time before any updates are made to your agent.

For infinitely long episodes, it’s not possible to compute the return at all! (You’d have to wait an infinite amount of time to get an answer.)

TD learning uses the immediate reward \( r \) and “estimates the rest” using the existing value of \( Q \).

Therefore, your agent learns on each step, rather than waiting until the end of an episode.

The Q-Learning “target” is:

$$ r + \gamma \max_a’ Q(s’, a’) $$

The main difference between SARSA and Q-Learning is that SARSA is on-policy while Q-Learning is off-policy.

This is because Q-Learning always uses the max, even when that action wasn’t taken. SARSA uses action \( a’ \) – whichever action was actually chosen for the next step. Q-Learning learns the value of the policy in which you take the max (even though your behavior policy may be different – e.g. epsilon-greedy). SARSA learns the value of the behavior policy.

Expected SARSA

For Expected SARSA, the target is:

$$ r + \gamma \sum_{a’} \pi( a’ | s’) Q(s’, a’) $$

This can also be written as:

$$ r + \gamma E_\pi [Q(s’, a’) | s’] $$

Thus, Expected SARSA uses the expected action-value for the next state \( s’ \), over all actions \( a’ \), weighted according to the policy distribution.

Like regular SARSA, this should remind you of the Bellman Equation for Q (even more so than regular SARSA since it now properly sums over the policy distribution).

You can think of regular SARSA as merely drawing samples from the Expected SARSA target distribution. Put a different way, SARSA does what Expected SARSA does, in expectation.

So how does Expected SARSA unify SARSA and Q-Learning?

In fact, Q-Learning is merely a special case of Expected SARSA.

First, recognize that Expected SARSA can be on-policy or off-policy. In the off-policy case, one could act according to some behavior policy \( b(a | s) \), while the target is computed according to the target policy \( \pi( a | s) \).

If \( b(a | s) \) is epsilon-greedy while \( \pi(a | s) \) is greedy, then Expected SARSA is equivalent to Q-Learning.

On the other hand, Expected SARSA generalizes SARSA as well, as discussed above.

Practical Considerations

1) Aside from all the theory above, the update itself is actually quite simple to implement. In some cases, you may find that it works better than SARSA or Q-Learning. Thus, it is a worthy addition to your RL toolbox and could form a part of your strategy for building agents to solve RL problems.

2) It’s slower than SARSA and Q-Learning since it requires summing over the action space on each update. So, you may have to decide whether the additional computation requirements are worth whatever increase in performance you observe.

In this post, we’ll extend our toolset for Reinforcement Learning by considering the Monte Carlo method with importance sampling.

In my course, “Artificial Intelligence: Reinforcement Learning in Python“, you learn about the Monte Carlo method. But that’s just the beginning. There is still more that can be done to improve the agent’s learning capabilities.

Before we continue, just a gentle reminder that the VIP discount coupons for Financial Engineering and PyTorch: Deep Learning and Artificial Intelligence are expiring in just one week!

CLICK HERE to get 75% OFF “Financial Engineering and Artificial Intelligence in python”
Topics covered:

Exploratory data analysis, significance testing, correlations

Alpha and beta

Time series analysis, simple moving average, exponentially-weighted moving average

Holt-Winters exponential smoothing model

ARIMA and SARIMA

Efficient Market Hypothesis

Random Walk Hypothesis

Time series forecasting (“stock price prediction”)

Modern portfolio theory

Efficient frontier / Markowitz bullet

Mean-variance optimization

Maximizing the Sharpe ratio

Convex optimization with Linear Programming and Quadratic Programming

Let’s begin by reviewing the regular Monte Carlo method covered in my Reinforcement Learning course.

Your job in a reinforcement learning task is to program an agent (characterized by a policy) that interacts with an environment (characterized by state transition dynamics). A picture of this process (more precisely, this article discusses a Markov Decision Process) is shown below:

The agent reads in a state \( S_t \) and decides what action \( A_t \) to perform based on the state. This is called the policy and can be characterized by a probability distribution, \( \pi( A_t | S_t) \).

As the agent does this action, it changes the environment which results in the next state \( S_{t+1} \). A reward signal \( R_{t+1} \) is also given to the agent.

The goal of an agent is to maximize its sum of future rewards, called the return, \( G_t \). The discounted return is defined as:

Since both the policy and environment transitions can be random, the return can also be random. Because of this, we can’t maximize “the” return (since there are many possible values the return can ultimately be), but only the expected return.

The expected return given that the agent is in state \( S_t \) and performs action \( A_t \) at time \( t \) is given by the Q-table. Specifically:

The Q-table can be used to determine what the best action will be since we can just choose whichever action \( a \) maximizes \( Q(s,a) \).

The problem is, we do not know \( Q(s,a) \)! Furthermore, we cannot calculate it directly since the expected value requires summing over the transition distribution \( p(s’, r | s, a) \).

Generally speaking, this is unknown. e.g. Imagine building a self-driving car.

The Monte Carlo approach is to estimate the action-value using the sample average. i.e.

$$ Q_\pi(s, a) \approx \frac{1}{N}\sum_{i=1}^{N} G^{(i)}(s,a) $$

Where \( G^{(i)}(s,a) \) was the sample return when the agent was in state \( s \) and performed action \( a \) during the \( i \)’th episode.

Put simply: play a bunch of episodes, collect all the state-action-reward sequences, calculate the returns (by summing up the rewards), and then compute the average return for each state-action pair.

Easy!

How can we ensure that we visit every state-action pair so that the whole Q-table is filled up with a sufficient number of samples?

Practically, we usually employ some kind of exploration strategy, such as epsilon-greedy. With epsilon-greedy, we perform the optimal action \( 1-\varepsilon \) of the time, and we pick a random action \( \varepsilon \) of the time. So \( \varepsilon \) is the probability of exploration.

The problem with this approach is that it leads to a suboptimal policy. Why? It means that approximately \( \varepsilon \) of the time, you are going to do something suboptimal!

(Note: it’s not exactly \( \varepsilon \) since choosing an action randomly can still lead to choosing the optimal action by chance.)

This is where we transition to the main topic of this article, which is how Importance Sampling can help us overcome this problem.

Monte Carlo with epsilon-greedy exploration is called an on-policy control method, because the action-value (Q-table) being estimated corresponds to the policy that the agent is following.

On the other hand, off-policy methods allow the agent to act according to one policy (called the behavior policy), while the action-value is computed for a different, eventually optimal policy (called the target policy).

Henceforth we will denote the target policy as \( \pi( a | s) \) and the behavior policy as \( b(a | s) \).

Importance Sampling

Suppose that we would like to estimate the expected value of some function \( s \) of a random variable \( X \) under some distribution \( \pi \). We write this as:

$$ E_\pi[ s(X) ] $$

If we know \( \pi \), then (assuming \( X \) is a discrete random variable) this can be computed as:

$$ E_\pi[ s(X) ] = \sum_x \pi(x)s(x) $$

If we don’t know \( \pi \), but \( X \) is sampled according to \( \pi \), then this expectation can be estimated by using the sample mean:

Now suppose that something prevents us from gathering samples according to the distribution \( \pi \), but it’s still possible to gather samples from a different distribution \( b \). i.e. We want \( X \sim \pi \) but we have \( X \sim b \) instead.

In this case, the above sample average does not give us the desired expectation. We would be estimating \( E_b[ s(X)] \) instead of \( E_\pi[ s(X) ] \).

The importance sampling solution is found by recognizing the following equalities:

This tells us that it’s possible to estimate the expectation under \( \pi \) even when the samples are drawn according to \( b \). All we have to do is multiply by the importance sampling ratio, \( \frac{\pi(X)}{b(X)} \).

The only requirement is that \( b(x) \) is not 0 when \( \pi(x) \) is not 0. This is called “coverage”.

Applying Importance Sampling to Reinforcement Learning

In reinforcement learning, the return \( G_t \) is generated by acting according to the behavior policy \( b(a | s) \) with transition dynamics \( p(s’, r | s, a) \). But we would like to know the expectation of \( G_t \) under the target policy \( \pi(a | s) \) with the same transition dynamics.

In this case, the importance sampling ratio is a bit more complicated but can still be derived. \( G_t \) is a sample from the distribution of \( p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim b \mspace{5mu} \forall \tau) \).

Basically this says: “the distribution of all the actions and states that happened after arriving in state \( S_t \), following the policy \( b \)”.

The distribution we want the expectation with respect to is the same thing, but with actions drawn according to \( \pi \). i.e. \( p(A_t, S_{t+1}, A_{t+1}, …, S_T | S_t, A_\tau \sim \pi \mspace{5mu} \forall \tau) \).

Thanks to the Markov property these distributions are easy to expand.

The transition dynamics cancel out because they are the same on both top and bottom.

This is convenient, because we know \( \pi \) and we know \( b \), but we do not know \( p \) (which is why we have to use Monte Carlo in the first place).

Let’s define this importance sampling ratio using the symbol \( \rho \).

Using this importance sampling ratio, we can estimate \( Q_\pi(s,a) \) even though we are acting according to a different policy \( b \) and using the returns generated from this other policy.

Where \( G^{(i)}(s,a) \) was the sample return when the agent was in state \( s \) and performed action \( a \) during the \( i \)’th episode, and \( \rho^{(i)}(s,a) \) was the corresponding importance sampling ratio.

Importance Sampling for Monte Carlo Implementation

At this point, you know all the theory. All that you have to do now is plug in the above importance sampling ratio in the appropriate places in your existing Monte Carlo code, and you’ll be doing Monte Carlo with importance sampling.

Here are some important considerations.

Like the return \( G_t \), the importance sampling ratio is defined in terms of future values. i.e. the importance sampling ratio at time \( t \) depends on the probabilities of the behavior and target policies at time \( t+1, t+2, … \).

Therefore, it would make sense to loop backwards in time to compute this ratio, just like we loop backwards in time to compute the return.

Just like the return, the importance sampling ratio can be computed recursively.

Finally, you’ll recall that for the regular unweighted sample mean, it’s possible to perform constant-time updates every time we collect a new sample, instead of naively summing up all the samples and dividing by N.

As an exercise, try to derive what \( \alpha^{(i)} \) should be.

Last point: I haven’t discussed weighted importance sampling, which can be used to reduce the variance of the estimate. The weighted importance sampling estimate looks like this:

Let’s review why this is different from regular Monte Carlo.

Regular Monte Carlo (what I covered in my Reinforcement Learning course) is an on-policy control method.

We use epsilon-greedy because exploration is required to collect enough samples for all state-action pairs. Epsilon-greedy is suboptimal by definition. Our Q-table and final policy will thus be suboptimal.

What might be better is an off-policy control method, where we act according to a behavior policy which allows for exploration, but compute the Q-table according to the target greedy policy (the optimal policy).

Just the other day, I came across a forum post discussing why job interviews for ML engineers and data scientists often involve algorithms and coding. “I don’t want to be a software engineer!” you say.

Except, this is reality. And if you want a job, you have to face this reality.

In this free video, I will show you many of the insightful comments posted in the discussion, and we are going to complete an example LeetCode interview question from beginning to end. I’m also going to show you a typical answer given by someone who doesn’t know algorithms and why that answer is wrong.

First, now is your chance to VOTE to tell me what you want to see in my next course. Transformers? Time Series Analysis? More advanced GANs? More advanced Reinforcement Learning? Let me know in this survey (it’s anonymous):

Second, check out the latest episode of The Lazy Programmer Show, where I discuss why bad programmers are always trying to get the latest version of some language or library, and why they tend to freak out when things are not in whatever version they happen to be using.

I look at this topic from 3 different perspectives, including:

1) What is it like in the “real world” working at a “real job”?

2) What kind of skills should a typical (competent) programmer have?

3) What are students learning and how do they approach machine learning and coding in colleges? Remember those students become “new grads” and those “new grads” become “junior engineers”.

What would your boss say if a junior engineer could run circles around you, a so-called “professional”?

In the latest episode of the Lazy Programmer Show, I give you my honest opinion of Tensorflow 2, one year later after creating the leading Tensorflow 2 course on Udemy.

Is it still good?

Is it still worth learning?

Why does stuff keep breaking / changing?

How does it compare to other offerings? (PyTorch, JAX, Julia)

Did you know JAX was created by Google? (thus, using Tensorflow doesn’t equate to “doing what Google’s doing”)

Julia is a totally different programming language popular with many data scientists and machine learning engineers. Will it replace Python?

PLUS, a little bonus (but you’ll have to watch the video to see what it is) 😉

The following books can be used to study core computer science topics at the college / university level, to prepare yourself for machine learning, deep learning, artificial intelligence, and data science.

These are the books I recommend for building your own computer science degree. Remember! The goal is to do as many exercises as you can. It’s not to just watch 5 minute YouTube videos and then conclude “I understand everything! There’s no need for exercises!”

This quote from the video sums it up nicely: if you don’t find the problems, the problems will find you.

These books cover common core courses that are relevant for many sub-fields of Computer Science and Engineering, including Machine Learning et. al., but also related fields such as operations research, statistics, quantitative finance, software engineering, digital communications, wireless communications, control systems (e.g. autopilot), robotics, and many more.

To recap, these are the courses and why you want to take them:

Calculus

Nearly all machine learning algorithms boil down to optimization problems. What is optimization? Generally speaking, it’s when you have a function and you want to maximize or minimize that function.

If you’ve taken calculus, then you should recall that this is exactly what you learn how to do in calculus.

Therefore, calculus is an essential tool in artificial intelligence, data science, etc.

Linear Algebra

In machine learning and deep learning especially, we work with vectors, matrices, and higher-dimensional objects. This is the realm of linear algebra.

Luckily, you don’t have to go that far in linear algebra to get what you need for deep learning and AI.

For example, the concept of spans, subspaces, rank, etc. rarely show up in machine learning.

On the other hand, the basics, such as matrix and vector multiplication and eigenvalues and eigenvectors, show up often.

Probability

Probability is the language you must speak if you want to do machine learning and AI.

Recall above that machine learning often boils down to an optimization. What are we trying to optimize? Often, it’s an expected value. What is an expected value? Well, you have to learn probability to find that out.

For a time, “probabilistic graphical models” were the state of the art in artificial intelligence. Clearly, probability would be a prerequisite.

Probability shows up nearly everywhere in machine learning, from soft k-means clustering to hidden Markov models to artificial neural networks and beyond.

Side note: if you were thinking earlier, “who needs calculus when Tensorflow can do it for me!?”, think again. Calculus is a prerequisite to probability. So if you want to learn probability, you still need calculus anyway.

Programming

Obviously, at some point, you need to be able to actually write a computer program in order to use machine learning.

Nowadays, things can seem very simple when all you need is 3 lines of boilerplate code to use scikit-learn.

However, that’s not really what one should imagine when they think of “learning machine learning”.

Check any college-level machine learning course (not that bullshit being sold by marketers online) to confirm what I am saying.

As the great physicist Richard Feynman once said, “What I cannot create, I do not understand”.

In order to best understand a machine learning algorithm, you should implement it.

No, you are not “reinventing the wheel”. This is called “learning”.

I would posit that if you can’t implement basic algorithms like k-means clustering, logistic regression, k-nearest neighbor, and naive Bayes, you do not understand those algorithms.

So why do I suggest Java over something like Python, which has easily become the most popular language for doing data science.

The problem with Python is that it’s too high level. It doesn’t make you think about the program you are writing at the algorithmic level.

You should understand the difference between an efficient and an inefficient algorithm. (No, that doesn’t mean memorizing facts like “Python list comprehensions are better than for loops”).

In fact you should recognize that list comprehensions have the exact same time complexity as your for loop.

Java, being slightly lower level, forces you to think algorithmically.

That brings us to the final topic.

Algorithms and Data Structures

In order to really understand algorithms, you should study… algorithms.

There are many famous algorithms contained in the book I’ve suggested below.

Realistically, you are not going to use these in your day to day work (a very common complaint from software developers seeking employment).

However, that’s not really the point.

The point is exercising your brain and learning how to think in certain ways that help you write better code.

Also, you should understand the pros and cons of basic data structures such as lists, sets, dictionaries, and trees.

If you’re coding up some algorithm, why might a set be better than a list? Algorithms tell you why.

One major data structure you might want to learn about is graphs (along with their associated algorithms). Graph neural networks seem to be picking up steam, and they are being used for all kinds of interesting problems like social network analysis, chemistry, and more.

To summarize: the core courses I would consider essential for building your own Computer Science degree in preparation for machine learning, deep learning, data science, and artificial intelligence are calculus, linear algebra, probability, programming, and algorithms.

Don’t just watch a bunch of videos on YouTube or Khan Academy and then proclaim you understand the subject. The reason I’ve suggested books is because they contain exercises / homework problems. These are what you must be able to do in order to claim that you understand something. It’s not about “absorbing information”, it’s about “producing useful output”.

Common question: What about C++? Yes, C++ is excellent! Ideally, you will learn both C++ and Java, but obviously, these are not hard prerequisites for machine learning or data science.

Calculus: Early Transcendentals by James Stewart

Introduction to Linear Algebra by Gilbert Strang

Introduction to Probability by Bertsekas and Tsitsiklis

Big Java by Cay Horstmann

Introduction to Algorithms by Cormen, Leiserson, Rivest, and Stein

Disclaimer: this post contains Amazon affiliate links.

This lecture will answer a common beginner question: why does the lecture “how to succeed in this course” exist?

Note that in some courses, this may have been replaced with the alternate lecture, “anyone can succeed in this course”.

Every now and then, this lecture really offends someone (probably because it hits one or two of their “sensitive spots” and they know that they are making some of the mistakes I mention in that lecture). But I’d rather not speculate, and furthermore, there is no reason to.

Let’s get down to business:

Your interpretation of this lecture is wrong.

Your main mistake is that you haven’t thought of your fellow students (and there are thousands of them).

Obviously, this lecture wouldn’t have been made if it were not deemed necessary.

Common sense (hopefully) should tell you that these “strange statements” (yes, someone really called them that, haha!) weren’t invented out of thin air.

Clearly, based on this lecture, you can conclude that:

– there are students who don’t follow instructions and need a reminder

– there are students who don’t meet the prerequisites and need a reminder

– there are students who believe all the prerequisites can be included in this course (not true)

– there are students who believe they can understand this course without meeting the prerequisites (not true)

– there are students who make up self-defeating excuses not to use the Q&A

– there are students who don’t know that coding is an exercise

– there are students who don’t know about the speed changer

– there are students who forget things and don’t realize that taking notes would have prevented them from forgetting

– there are students who believe their PhD absolves them from following the instructions or meeting the prerequisites

– etc.

Need I say more?

Finally, understand that the course is not personalized for you. It’s for everybody.

Therefore, if you find a lecture that doesn’t contain content you want to learn, simply skip it.

It should be very easy. Let me know if you can’t find the “skip” button.

In fact, this is stated in this very lecture, so maybe you should have watched it more carefully. 😉

The lesson is: be kind, courteous, understanding, and accommodating to your fellow students. Hopefully, this makes sense.

Common assertions:

it is patronizing

It’s only patronizing if you believe you are “above” this advice, in which case, you should double check that the problem isn’t that you think too highly of yourself. 😉

it is negative

Is today backwards day?

“Anyone can succeed in this course” is a negative statement?

The opposite of that is “Nobody can succeed in this course”. I suppose that would be positive?

😂😂😂

Overall:

Nobody to this date has ever been able to provide a direct rebuttal to any of the points I have mentioned above.

If you think you can rebut anything I have said, I am actually so eager to hear what you have to say. I have been waiting years for someone to tell me why any of the above points are wrong!

One of the most common questions I get from beginners in machine learning is, “how do I practice what I’ve learned?”

There are several ways to answer this.

First, let’s make an important distinction.

There’s a difference between putting in the work to understand an algorithm, and using that algorithm on data. We’ll call these the “learning phase” and the “application phase”.

Learning phase = Putting in the work to understand an algorithm
Application phase = Using that algorithm on data

Let’s take a simple example: linear regression.

In the learning phase, your tasks will include:

Being able to derive the algorithm from first principles (that’s calculus, linear algebra, and probability)

Implementing the algorithm in the language of your choice (it need not be Python)

Testing your algorithm on data to verify that it works

These are essential tasks in ensuring that you really understand an algorithm.

Doing these tasks are “exercises” which improve your general aptitude in machine learning, and will strengthen your ability to learn other algorithms in the future, such as logistic regression, neural networks, etc.

As my famous motto goes: “if you can’t implement it, then you don’t understand it”.

Interestingly, 5 years after I invented this motto, I discovered that the famous physicist Richard Feynman said a very similar thing!

In order to get an infinite amount of practice in this area, you should learn about various extensions on this algorithm, such as L1 and L2 regularization, using gradient descent instead of the closed-form solution, 2nd order methods, etc.

You might want to try implementing it in a different language. And finally, you can spend a lifetime exercising your ability to understand machine learning algorithms by learning about more machine learning algorithms in much the same way.

Believe me, 10 years down the line you may discover something new and interesting about even the simplest models like Linear Regression.

The second phase is the “application phase”.

Here is where your ability to exercise and practice is really infinite.

Let’s first remember that I don’t know you personally. I don’t know what you care about, what field you are in, or what your motivations for learning this subject are.

Therefore, I cannot tell you where to apply what you’ve learned: only you know that.

For example, if you are a computational biologist, then you can use this algorithm on problems specific to computational biology.

If you are a financial engineer, then you can use this algorithm on problems specific to financial engineering.

Of course, because I am not a computational biologist, I don’t know what that data looks like, what the relevant features are, etc.

I can’t help you with that.

The “interface” where I end and you begin is the algorithm.

After I teach you how and why the algorithm works and how to implement it in code, using it to further scientific knowledge in your own field of study becomes your responsibility.

One can’t expect me to be an expert computational biologist and an expert financial engineer and whatever else it is that you are an expert in.

Therefore, you can’t rely on me to tell you what datasets you might be interested in, what kinds of problems you’re trying to solve, etc.

Presumably, since you’re the expert, you should know that yourself!

If you don’t, then you are probably not the expert you think you are.

But therein lies the key.

Once you’ve decided what you care about, you can start applying what you’ve learned to those datasets.

This will give you an infinite amount of practice, assuming you don’t run out of things to care about.

If you don’t care about anything, well then, why are you doing this in the first place? Lol.

This also ties nicely into another motto of mine: “all data is the same”.

What does this mean?

Let’s recall the basic “pattern” we see when we use scikit-learn:

model = LinearRegression()
model.fit(X_train, Y_train)
model.predict(X_test)

Does this code change whether (X, Y) represent a biology dataset or a finance dataset?

The answer is no!

Otherwise, no such library as Scikit-Learn could even exist!

“All data is the same” means that the same Linear Regression algorithm applies, no matter what field or what industry your data happens to come from.

There’s no such thing as “Linear Regression for biology” and “Linear Regression for finance”.

There’s only one linear regression that is the same linear regression no matter the dataset.

Thus, you learn the algorithm once, and you can apply it infinitely to any number of datasets!

Pretty cool huh?

But look, if you really have zero idea of what you care about, or your answer is “I care about machine learning”, then there are plenty of stock datasets that you can look up on your own.

These include Kaggle, the UCI repository, etc. There’s so much data out there, you will still have to pick and choose what to focus on first.

Again, you have to choose what you care about. Nobody else can possibly tell you that with any accuracy.

Addendum:

The “learning phase” above does not apply to situations where you’re learning an API (for example, Tensorflow 2, PyTorch, or even Scikit-Learn).

Why?

Well firstly, there’s nothing really to derive.

Secondly, it would be impossible for you to implement anything yourself without me showing you how first (at which point anything you type would amount to simply copying what I did).

Why?

Well, how would you know what to type if I didn’t show you?

Are you going to magically come up with the correct syntax for a library that you simply haven’t learned?

Obviously not. That would be a ludicrous idea.

In this case, the “learning phase” amounts to:

Understanding the syntax I’ve shown you

Being able to replicate that syntax on your own

This most closely represents what you will do in the “real-world”. In the real-world, you want to be able to write code fast and efficiently, rather than trying to remember which course and which lecture covered that exact syntax you’re thinking of.

Being able to write code on-the-fly makes you efficient and fast. Obviously, I don’t remember everything, but I know where to look when I need something. You have to find the right balance between memorizing and looking up. Just like how computer programs use caches and RAM for fast data retrieval instead of the hard drive.

Obviously, this will get better over time as you practice more and more.

It sounds overly simplistic, but it’s nothing more than repetition and muscle memory. I usually don’t explicitly commit to memorizing anything. I just write code and let it come naturally. The key is: I write code.

Obviously, watching me play tennis or reading books about tennis will not make you a better tennis player.

You must get on the court, pick up the tennis racket, and play actual tennis matches!

I can’t force you to do this. It’s the kind of thing which must be done of your own volition.

I mean, if you want to pay me consulting hours to call you and remind you to practice, I’d be very happy to oblige. =)

At this point, once you have completed the “learning phase” of learning an API, then the “application phase” described above still applies.