How to Speak by Patrick Winston

May 30, 2022

Making a post on this for posterity. A student sent this to me the other day and I thought it was great.

Wish I had seen this when I was a grad student.

If you are in undergrad or thinking about going to grad school, definitely watch this video.

I could probably apply some of this to my courses too!


Go to comments

[NEW COURSE] Data Science: Transformers for Natural Language Processing

May 25, 2022

Data Science: Transformers for Natural Language Processing

VIP Promotion

The complete Transformers course has arrived

Hello friends!

Welcome to my latest course, Transformers for Natural Language Processing (NLP).

Don’t want to read my little spiel? Just click here to get the VIP discount:

Link 1)

Link 2) (expires in 30 days – June 25, 2022!)

(expires July 26, 2022)

Transformers have changed deep learning immensely.

They’ve massively improved the state-of-the-art in all NLP tasks, like sentiment analysis, machine translation, question-answering, etc.

They’re even expanding their influence into other fields, such as computational biology and computer vision. DeepMind’s AlphaFold 2 has been said to “solve” a longstanding problem in molecular biology, known as protein structure prediction. Recently, DALL-E 2 demonstrated the ability to generate amazing art and photo-realistic images based only on simple text prompts. Imagine that – creating a realistic image out of just an idea!

Just within the past week, DeepMind introduced “Gato“, which is what they call a “generalist agent”, an AI that can do multiple things, like chat (i.e. do NLP!), play Atari games, caption images (i.e. computer vision!), manipulate a real, physical robot arm to stack blocks, and more!

Gato does all this by converting all the usual inputs from other domains into a sequence of tokens, so that they can be processed just like how we do in NLP. This is a great example of my oft-repeated rule, “all data is the same” (and also, another great reason to learn NLP since it would be a prerequisite to understanding this).


The course is split into 3 major parts:

  1. Using Transformers (Beginner)
  2. Fine-Tuning Transformers (Intermediate)
  3. Transformers In-Depth (Expert – VIP only)


In part 1, you will learn how to use transformers which were trained for you. This costs millions of dollars to do, so it’s not something you want to try by yourself!

We’ll see how these prebuilt models can already be used for a wide array of tasks, including:

  • text classification (e.g. spam detection, sentiment analysis, document categorization)
  • named entity recognition
  • text summarization
  • machine translation
  • question-answering
  • generating (believable) text
  • masked language modeling (article spinning)
  • zero-shot classification

This is already very practical.

If you need to do sentiment analysis, document categorization, entity recognition, translation, summarization, etc. on documents at your workplace or for your clients – you already have the most powerful state-of-the-art models at your fingertips with very few lines of code.

One of the most amazing applications is “zero-shot classification”, where you will observe that a pretrained model can categorize your documents, even without any training at all.


In part 2, you will learn how to improve the performance of transformers on your own custom datasets. By using “transfer learning”, you can leverage the millions of dollars of training that have already gone into making transformers work very well.

You’ll see that you can fine-tune a transformer for many of the above tasks with relatively little work (and little cost).


In part 3 (the VIP sections), you will learn how transformers really work. The previous sections are nice, but a little too nice. Libraries are OK for people who just want to get the job done, but they don’t work if you want to do anything new or interesting.

Let’s be clear: this is very practical.

How practical, you might ask?

Well, this is where the big bucks are.

Those who have a deep understanding of these models and can do things no one has ever done before are in a position to command higher salaries and prestigious titles. Machine learning is a competitive field, and a deep understanding of how things work can be the edge you need to come out on top.

We’ll also look at how to implement transformers from scratch.

As the great Richard Feynman once said, “what I cannot create, I do not understand”.



  • As usual, I wanted to get this course into your hands as early as possible! There are a few sections and lectures still in the works, including (but not limited to): fine-tuning for question-answering, more theory about transformers, and implementing transformers from scratch. As usual, I will update this post as new lectures are released.
  • Everyone makes mistakes (including me)! Because this is such a large course, if I forgot anything (e.g. a Github link), just email me and let me know.
  • Due to the way Udemy now works, if you purchase the course on, I cannot give you access to the Udemy version. It hasn’t always been this way, and Udemy has tended to make changes over the years that negatively impact both me and you, unfortunately.
  • If you don’t know how “VIP courses” work, check out my post on that here. Short version: always houses all the content (both VIP and non-VIP). Udemy will house all the content initially, but the VIP content is removed later on.

So what are you waiting for? Get the VIP version of Transformers for Natural Language Processing NOW:

Go to comments

Become a Millionaire by Taking my Financial Engineering Course

May 17, 2022

I just got an excellent question today about my Financial Engineering course, which allowed me to put into words many thoughts and ideas I’d been pondering recently.

Through this post, I hope to get all these ideas into one place for future reference.


The question was: “How practical is this course? I’ve skimmed through several top ratings on Udemy but have yet seen one boasting how much money the student made after taking it

Will you become a millionaire after taking my financial engineering course?


Let’s answer this question by starting with my own definition of “practical”, and then subsequently addressing the student’s definition of practical which appears to mean “making money”.

In my view, “practical” simply means you’re applying knowledge to a real-world dataset.

For example, my Recommender Systems course is practical because you apply the algorithms we learn to real-world ratings datasets.

My Bayesian Machine Learning: A/B Testing course is practical because you can apply the algorithms to any business scenario where you have to decide between multiple choices based on some numerical objective (e.g. clicks, page view time, etc.)

In the same way, the Financial Engineering course is extremely practical, because the whole course is about applying algorithms to real-world financial datasets. The application is a real-world problem.

This is unlike, say, reading Pattern Recognition and Machine Learning by Bishop, which is all about the algorithms and not the fields of application. The implication is that, you know what you’re doing and can take those algorithms and apply them to your own data.

On one hand, that’s powerful – because you can apply these algorithms to any field (like biology, astronomy, chemistry, robotics, control systems, and yes, finance), but at the same time, you have to be pretty smart to do it. The average Udemy student would struggle.

In that sense, this is the most practical you can get. Everything you learn in this course is being directly applied to real-world data in a specific field (finance).

You can grab one of the algorithms taught in the course and start using it today on your own investing account. There’s a lecture about that in the Summary section called “Applying This Course” for those who need extra help.

Importantly, do keep in mind that while I can teach you what to do, I can’t actually make you do it.

In A/B Testing, I can show you the code, but the rest is up to the student to make it practical, by actually getting a job where they get to do that in a production system, or by inserting the code into their own production website so they can feed it to live users.

Funny enough, A/B Testing isn’t even about finance nor money. But will you make money with those techniques? YES. Amazon, Facebook, Netflix, etc. are already using the same techniques with great success.

The only reason some students might say it’s not practical is because they are too lazy/incompetent to get off their butts and actually do it!

Same here. I can teach the algorithms, but I can’t go into your brokerage account and run them for you.


Now let’s consider the definition of “practical” in the sense of being guaranteed to “make money”.

This is a common concern among students who are new to finance and don’t really know yet what to expect.

Let’s suppose I could guarantee that by taking this course, you could make money.

Consider some obvious questions:

  • If this were true, anyone (including myself) would just scale it up and become extremely wealthy without doing any work. Clearly, no such thing exists (that is public and that we know of).
  • If this were true, why would anyone work? Financial engineering graduates wouldn’t bother to apply for jobs, they would just run algorithms all day. They would teach their friends / family to do the same. No one would ever bother to get a job.
  • If this were true, why would hedge funds bother to hire employees? After inventing an algorithm, they could just run it forever. What’s the point of wasting money to hire humans? What would they even do?
  • If this were true, why would hedge funds bother to hire PhDs and why would people bother to get PhDs? Imagine you could increase your investments infinitely from a 20 hour online course. What kind of insane person would work for 4-7 years just to get a pittance and a paper that says “PhD”?

On the contrary, the reality is this.

The financial sector does hire very smart people and it is well-known that they have poor work-life balance.

They must be working hard. What are they doing?

Why can’t they just learn an algorithm and sit back and relax?


Instead, let’s expand the definition of “practical”.

Originally, this question was asked in a comment on a video I made about predicting stock prices with LSTMs. Is this video practical? YES. If you didn’t know this, you could have spent weeks / months / maybe even your whole life trying to “predict stock prices with LSTMs”, with zero clue that it didn’t actually work. That would be sad.

Spending weeks or months doing something that doesn’t even make sense is what I would consider to be very impractical. And hence, learning how to avoid it would be very practical.

A lot of the course is about how to properly model and analyze. How to stay away from stupidity.

One of the major themes of the course is that “Santa Claus doesn’t exist”.

A naive person might think “there must be some way to predict the stock price, you are just not telling me about the most advanced algos!”

But the “Santa Claus doesn’t exist” moment is when we prove mathematically why certain predictions are impossible.

This is practical because it saves you from attempting something which doesn’t make any logical sense.

Obviously, it doesn’t fulfill the childhood dream of meeting Santa (predicting an unpredictable time series), but I would posit that trying to meet Santa is what is really impractical.

What is actually practical is learning how to determine whether you can or cannot predict a time series (at which point, you can then make your predictions as normal).

I’ll give you another example lesson.

If you used the simplest trading strategy from this course, you could have beat the market from 2000 – 2018.

Using the same algorithm, you would have underperformed the market from 2018 to now.

The practical lesson there is that “past performance doesn’t indicate future performance”.

This is how you can have a “practical” lesson, which doesn’t automatically imply “guaranteed rate of return” (which is impossible).

Addendum: actually, it is possible to guarantee a rate of return. Just purchase a fixed-income security like a CD (certificate of deposit) at your bank. The downside is that the rate of return is very low. This is yet another practical lesson from the course – the tradeoff between risk and reward and how real-world entities automatically adjusts themselves to match present conditions. In other words, you’ll never find a zero-risk asset that guarantees 1000x returns. Why is this practical? Again, you want to avoid wasting time searching for that which does not exist.

Go to comments

Machine Learning in Finance by Dixon, Halperin, Bilokon – A Critique

May 16, 2022

Check out the video version of this post on YouTube:


In this post, I’m going to write about one of my all-time favorite subjects: the wrong way to predict stock and cryptocurrency prices.

Despite the fact that I’ve discussed this many times before, I’m very excited about this one.

It’s not everyday I get to critique a published book by a big name like Springer.

The book I’m referring to is called “Machine Learning in Finance: From Theory to Practice”, by Matthew Dixon, Igor Halperin, and Paul Bilokon.

Now you might think I’m beating a dead horse with this video, which is kind of true.

I’ve already spoken at length about the many mistakes people make when trying to predict stock prices.

But there are a few key differences with this video.

Firstly, in past videos, I’ve mentioned that it is typically bloggers and marketers who put out this bad content.

This time, it’s not a blogger or marketer, but an Assistant Professor of Applied Math at the Illinois Institute of Technology.

Secondly, while I’ve spoken about what the mistakes are, I’ve never done a case study where I’ve broken down actual code that makes these mistakes.

This is the first.

Thirdly, in my opinion, this is the most important topic to cover for beginners to finance, because it’s always the first thing people try to do. They want to predict future prices so they know what to invest in today.

If you take my course on Financial Engineering, you’ll learn that this is completely untrue. Price prediction barely scratches the surface of true finance.


In order to get the code I’ve used in this video, please use this link:

Note that it’s a copy of the code provided with the textbook, with added code for my own experiments (computing the naive forecast and the corresponding train / test MSE).

I also removed code for a different type of RNN called the “alpha RNN”, which uses an old version of Keras. Removing this code doesn’t make a difference in our results because this model didn’t perform well.


The mistakes I’ll cover in this post are as follows.

1) They only standardize the price time series, which does nothing about the problem of extrapolation.

2) They never check whether their model can beat the naive forecast. Spoiler alert. I checked, and it doesn’t. The models they built are worse than useless.

3) Misleading train-test split.


So let’s talk about mistake #1, which is why standardizing a price time series does not work.

The problem with prices is that they are ever increasing. This wasn’t the case for the time period used in the textbook, but it is the case in general.

Why is this an issue?

The train set is always in the past, and the test set is always in the future.

Therefore, the values in the test set in general will be higher than the values in the train set.

If you build an autoregressive model based on this data, your model will have to extrapolate to a domain never seen before in the train set.

This is not good, because machine learning models suck at extrapolation.

How they extrapolate has more to do with the model itself, than it has to do with the data.

We analyzed this phenomena in my course on time series analysis.

For instance, decision trees tend to extrapolate by going horizontally outward.

Neural networks, Gaussian Processes, and other models all behave differently, and none of these behaviors are related to the data.


Mistake #2, which is the worst mistake, is that the authors never check against the naive forecast.

As you recall, the naive forecast is when your prediction is simply the past known value.

In their notebook, the authors predict 4 time steps ahead.

So effectively, our naive prediction is the price from 4 time steps in the past.

Even this very dumb prediction beats their fancy RNN models. Surprisingly, this happens not just for the test set, but the train set as well.


Mistake #3 is the misleading train-test split.

In the notebook, the authors make a plot of their models’ predictions against the true price.

Of course, the error looks very small and very close to the true price in all cases.

But remember that this is misleading. It doesn’t tell you that these models actually suck.

In time series analysis, when we think of a test set, we normally think of it as the forecast horizon.

Instead, the forecast horizon is actually 4 time steps, and the plot actually just shows the incremental predictions at each time step using true past data.

To be clear, although this is not a forecast, it’s also not technically wrong, but it’s still misleading and totally useless for evaluating the efficacy of these models.

As we saw from mistake #2, even just the naive forecast beats these models, which you wouldn’t know from these seemingly good plots.


So I hope this post serves as a good lesson that you always have to be careful about how you apply machine learning in finance.

Even big name publishers like Springer, and reputable authors who might even be college professors, are not immune to these mistakes.

Don’t trust everything you see, and always experiment and stress test any claims.

Go to comments

FREE Exercise: Predict Stocks with News, + Other ML News

January 19, 2022

TL;DR: this is an article about how to predict stocks using the news.

In this article, we are going to do an exercise involving my 2 current favorite subjects: natural language processing and financial engineering!

I’ll present this as an exercise / tutorial, so hopefully you can follow along on your own.

One comment I frequently make about predicting stocks is that autoregressive time series models aren’t really a great idea.

Basic analysis (e.g. ACF, PACF) shows no serial correlation in returns (that is, there’s no correlation between past and future) and hence, the future is not predictable from the past.

The best-fitting ARIMA model is more often than not, a simple random walk.

What is a random walk? If you haven’t yet learned this from me, then basically think of it like flipping a coin at each time step. The result of the coin flip tells you which way to walk: up the street or down the street.

Just as you can’t predict the result of a coin flip from past coin flips (by the way, this is essentially the gambler’s fallacy!), so too is it impossible to predict the next step of a random walk.

In these situations, the best prediction is simply the last-known value.

This is why, when one tries to fit an LSTM to a stock price time series, all it ends up doing is predicting close to the previous value.

There is a nice quote which is unfortunately (as far as I know) unattributed, that says something like: “trying to predict the future from the past is like trying to drive by looking through the rearview mirror”.

Anyway, this brings us to the question: “If I don’t use past prices, then what do I use?”

One common approach is to use the news.

We’ve all seen that news and notable events can have an impact on stock / cryptocurrency prices. Examples:

  • The Omicron variant of COVID-19
  • High inflation
  • Supply-chain issues
  • Elon Musk tweeting about Dogecoin
  • Mark Zuckerberg being grilled by the government

Luckily, I’m not going to make you scrape the web to download news yourself.

Instead, we’re going to use a pre-built dataset, which you can get at:

Briefly, you’ll want to look at the “combined” CSV file which has the following columns:

  • Date (e.g. 2008-08-11 – daily data)
  • Label (0 or 1 – whether or not the DJIA went up or down)
  • Top1, Top2, …, Top25 (news in the form of text, retrieved from the top 25 Reddit news posts)

Note that this is a binary classification problem.

Thanks to my famous rule, “all data is the same“, your code should be no different than a simple sentiment analysis / spam detection script.

To start you off, I’ll present some basic starter code / tips.


Tip 1) Some text contains weird formatting, e.g.

b”Georgia ‘downs two Russian warplanes’ as cou…

Basically, it looks like how a binary string would be printed out, but the “b” is part of the actual string.

Here’s a simple way to remove unwanted characters:


Tip 2) Don’t forget that this is time-ordered data, so you don’t want to do a train-test split with shuffling (mixing future and past in the train and test sets). The train set should only contain data that comes before the test set.


Tip 3) A simple way to form feature vectors from the news would be to just concatenate all 25 news columns into a single text, and then apply TF-IDF. E.g.

I’ll leave the concatenation part as an exercise for you.


Here are some extra thoughts to consider:

  • How were the labels created? Does that method make sense? Is it based on close-close or open-close?
  • What were the exact times that the news was posted? Was there sufficient time between the latest news post and the result from which the label is computed?
  • Returns tend to be very noisy. If you’re getting something like 85% test accuracy, you should be very suspicious that you’ve done something wrong. A more realistic result would be around 50-60%. Even 60% would be considered suspiciously high.


So that’s basically the exercise. It is simple, yet hopefully thought-provoking.


Now I didn’t know where else to put this ML news I found recently, but I enjoyed it so I want to share it with you all.

First up: “Chatbots: Still Dumb After All These Years

I enjoyed this article because I get a lot of requests to cover Chatbots.

Unfortunately, Chatbot technology isn’t very good.

Previously, we used seq2seq (and also seq2seq with attention) which basically just learns to copy canned responses to various inputs. seq2seq means “sequence to sequence” so the input is a sequence (a prompt) and the target/output is a sequence (the chatbot’s response).

Even with Transformers, the best results are still lacking.


Next: “PyTorch vs TensorFlow in 2022

Wait, people are still talking about this in 2022? You betcha!

Read this article. It says a lot of the same stuff I’ve been saying myself. But it’s nice to hear it from someone else.

It also provides actual metrics which I am too lazy to do.


Finally: “Facebook’s advice to students interested in artificial intelligence

This isn’t really “new news” (in fact, Facebook isn’t even called Facebook anymore) but I recently came across this old article I saved many years earlier.

Probably the most common beginner question I get is “why do I need to do all this math?” (in my ML courses).

You’ve heard the arguments from me hundreds of times.

Perhaps you are hesitant to listen to me. That would be like listening to your parents. Yuck.

Instead, why not listen to Yann LeCun? Remember that guy? The guy who invented CNNs?

He’s the Chief AI Scientist at Facebook (Meta) now, so if you want a job there, you should probably listen to his advice…

And if you think Google, Netflix, Amazon, Microsoft, etc. are any different, well, that is wishful thinking my friends.

What do you think?

Is this convincing? Or is Yann LeCun just as wrong as I am?

Let me know!

Go to comments

Convert a Time Series Into an Image with Gramian Angular Fields and Markov Transition Fields

August 30, 2021

In my latest course (Time Series Analysis), I made subtle hints in the section on Convolutional Neural Networks that instead of using 1-D convolutions on 1-D time series, it is possible to convert a time series into an image and use 2-D convolutions instead.

CNNs with 2-D convolutions are the “typical” kind of neural network used in deep learning, which normally are used on images (e.g. ImageNet, object detection, segmentation, medical imaging and diagnosis, etc.)

In this article, we will look at 2 ways to convert a time series into an image:

  1. Gramian Angular Field
  2. Markov Transition Field



Gramian Angular Field


The Gramian Angular Field is quite involved mathematically, so this article will discuss the intuition only, along with the code.

Those interesting in all the gory details are encouraged to read the paper, titled “Encoding Time Series as Images for Visual Inspection and Classification Using Tiled Convolutional Neural Networks” by Zhiguang Wang and Tim Oates.

We’ll build the intuition in a series of steps.

Let us begin by recalling that the dot product or inner product is a measure of similarity between two vectors.

$$\langle a, b\rangle = \lVert a \rVert \lVert b \rVert \cos \theta$$

Where \( \theta \) is the angle between \( a \) and \( b \).

Ignoring the magnitude of the vectors, if the angle between them is small (i.e. close to 0) then the cosine of that angle will be nearly 1. If the angle is perpendicular, the cosine of the angle is 0. If the two vectors are pointing in opposite directions, then the cosine of the angle will be -1.

The Gram Matrix is just the repeated application of the inner product between every vector in a set of vectors, and every other vector in that same set of vectors.

i.e. Suppose that we store a set of column vectors in a matrix called \( X \).

The Gram Matrix is:

$$ G = X^TX $$

This expands to:

$$G = \begin{bmatrix} \langle x_1, x_1 \rangle & \langle x_1, x_2 \rangle & … & \langle x_1, x_N \rangle \\ \langle x_2, x_1 \rangle & \langle x_2, x_2 \rangle & … & \langle x_2, x_N \rangle \\ … & … & … & … \\ \langle x_N, x_1 \rangle & \langle x_N, x_2 \rangle & … & \langle x_N, x_N \rangle \end{bmatrix} $$

In other words, if we think of the inner product as the similarity between two vectors, then the Gram Matrix just gives us the pairwise similarity between every vector and every other vector.


Note that the Gramian Angular Field (GAF) does not apply the Gram Matrix directly (in fact, each value of the time series is a scalar, not a vector).

The first step in computing the GAF is to normalize the time series to be in the range [-1, +1].

Let’s assume we are given a time series \( X = \{x_1, x_2, …, x_N \} \).

The normalized values are denoted by \( \tilde{x_i} \).

The second step is to convert each value in the normalized time series into polar coordinates.

We use the following transformation:

$$ \phi_i = \arccos \tilde{x_i}$$

$$ r_i = \frac{t_i}{N} $$

Where \( t_i \in \mathbb{N} \) represents the timestamp of data point \(x _i \).

Finally, the GAF method defines its own “special” inner product as:

$$ \langle x_1, x_2 \rangle = \cos(\phi_1 + \phi_2) $$

From here, the above formula for \( G \) still applies (except using \( \tilde{X} \) instead of \( X \), and using the custom inner product instead of the usual version).

Here is an illustration of the process:

So why use the GAF?

Like the original Gram Matrix, it gives you a “picture” (no pun intended) of the relationship between every point and every other point in the time series.

That is, it displays the temporal correlation structure in the time series.

Here’s how you can use it in code.

Firstly, you need to install the pyts library. Then, run the following code on a time series of your choice:


Note that the library allows you to rescale the image with the image_size argument.

As an exercise, try using this method instead of the 1-D CNNs we used in the course and compare their performance!


Markov Transition Field

The Markov Transition Field (MTF) is another method of converting a time series into an image.

The process is a bit simpler than that of the GAF.

If you have taken any of my courses which involve Markov Models (like Natural Language Processing, or HMMs) you should feel right at home.

Let’s assume we have an N-length time series.

We begin by putting each value in the time series into quantiles (i.e. we “bin” each value).

For example, if we use quartiles (4 bins), the smallest 25% of values would define the boundaries of the first quartile, the second smallest 25% of values would define the boundaries of the second quartile, etc.

We can think of each bin as a ‘state’ (using Markov model terminology).

Intuitively, we know that what we’d like to do when using Markov models is to form the state transition matrix.

This matrix has the values:

$$A_{ij} = P(s_t = j | s_{t-1} = i)$$

That is, \( A_{ij} \) is the probability of transitioning from state i to state j.

As usual, we estimate this value by maximum likelihood. ( \( A_{ij} \) is the count of transitions from i to j, divided by the total number of times we were in state i).

Note that if we have \( Q \) quantiles (i.e. we have \( Q \) “states”), then \( A \) is a \( Q \times Q \) matrix.

The MTF follows a similar concept.

The MTF (denoted by \( M \)) is an \( N \times N \) matrix where:

$$M_{kl} = A_{q_k q_l}$$

And where \( q_k \) is the quantile (“bin”) for \( x_k \), and \( q_l \) is the quantile for \( x_l \).

Note: I haven’t re-used the letters i and j to index \( M \), which most resources do and it’s super confusing.

Do not mix up the indices for \( M \) and \( A \)! The indices in \( A \) refer to states. The indices for \( M \) are temporal.

\( A_{ij} \) is the probability of transitioning from state i to state j.

\( M_{kl} \) is the probability of a one-step transition from the bin for \( x_k \), to the bin for \( x_l \).

That is, it looks at \( x_k \) and \( x_l \), which are 2 points in the time series at arbitrary time steps \( k \) and \( l \).

\( q_k \) and \( q_l \) are the corresponding quantiles.

\( M_{kl} \) is then just the probability that we saw a direct one-step (i.e. Markovian) transition from \( q_k \) to \( q_l \) in the time series.

So why use the MTF?

It shows us how related 2 arbitrary points in the time series are, relative to how often they appear next to each other in the time series.


Here’s how you can use it in code.

Note that the library allows you to rescale the image with the image_size argument.

As an exercise, try using this method instead of the 1-D CNNs we used in the course and compare their performance


Go to comments

Should you study the theory behind machine learning?

August 23, 2021

In this post, I want to discuss why you should not study the theory behind machine learning.

This may surprise some of you, since my courses can appear to be more “theoretical” than other ML courses on popular websites such as Udemy.

However, that is not the kind of “theory” I am talking about.


Most popular courses in ML don’t look at any math at all.

They are popular precisely for this reason: lack of math makes them accessible to the average Joe.

This does a disservice to you students, because you end up not having any solid understanding about how the algorithm works.

You may end up:

  • doing things that don’t make sense, due to that lack of understanding.
  • only being able to copy code from others, but not write any code yourself.
  • not knowing how to apply algorithms to new kinds of data, without someone showing you how first.

For more discussion on that, see my post: “Why do you need math for machine learning and deep learning?

But let’s make this clear: math != theory.


When we look at math in my courses, we only look at the math needed to derive the algorithm and understand how it works at an intuitive level.

Yes, believe it or not, we are using math to improve our intuition.

This is despite what many beginners might think. When they see math, they automatically assume “math” = “not intuitive”, and that “intuitive” = “pictures, animations, and purposely avoiding math”.

That’s OK if you want to read a news article in the NY Times about ML, but not when you want to be a practitioner of ML.

Those are 2 different levels of “intuition” (layman vs. practitioner).


To see an extreme example of this, one need not look any further than Albert Einstein. Einstein was great at communicating his ideas to the public. Everyone can easily understand the layman interpretation of general relativity (mass bends space and time). But this is not the same as being a practitioner of relativistic physics.

Everyone has seen this picture and understands what it means at a high level. But does that mean you are a physicist or that you can “do physics”?

Anyway, that was just an aside so we don’t confuse “math used for intuition” and “layman intuition” and “theory”. These are 3 separate things. Just because you’re looking at some math, does not automatically imply you’re looking at “theory”.



What do we mean by “theory”?

Here’s a simple question to consider. Why does gradient descent work?

Despite the fact that we have used gradient descent in many of my courses, and derived the gradient descent update rules for neural networks, SVMs, and other models, we have never discussed why it works.

And that’s OK!

The “mathematical intuition” is enough.

But let’s get back to the question of this article: Why is the Lazy Programmer saying we should not study theory?


Well, this is the kind of “theory” that gets so deep, it:

  • Does not produce any near-term gains in your work
  • Requires a very high level of math ability (e.g. real analysis, optimization, dynamical systems)
  • Is on the cutting-edge of understanding, and thus very difficult, likely to be disputed or even superseded in the near future


Case in point: although we have been using gradient descent for years in my courses (and decades before that in general), our understanding is still not yet complete.

Here’s an article that just came out this year on gradient descent (August 2021): “Computer Scientists Discover Limits of Major Research Algorithm“.

Here’s a direct link to the corresponding paper, called “The Complexity of Gradient Descent: CLS = PPAD ∩ PLS”:

There will be more papers on these “theory” topics in the years to come.


My advice is not to go down this path, unless you really enjoy it, you are doing graduate research (e.g. PhD-level), you don’t mind if ideas you spent years and years working on might be proven incorrect, and you have a very high level of math ability in subjects like real analysis, optimization, and dynamical systems.

Go to comments

Predicting Stock Prices with Facebook Prophet

August 3, 2021

Prophet is Facebook’s library for time series forecasting. It is mainly geared towards business datasets (e.g. predicting adspend or CPU usage), but a natural question that comes up with my students whenever we talk about time series is: “can it predict stock prices?”

In this article, I will discuss how to use FB Prophet to predict stock prices, and I’ll also show you what not to do (things I’ve seen in other popular blogs). Furthermore, we will benchmark the Prophet model with the naive forecast, to check whether or not one would really want to use this.

Note: This is an excerpt from my full VIP course, “Time Series Analysis, Forecasting, and Machine Learning“. If you want the code for this example, along with many, many other code examples on stock prices, sales data, and smartphone data, get the course!

The Prophet section will be part of the VIP version only, so get it now while the VIP coupon is still active!


How does Prophet work?

The Prophet model is a 3 component, non-autoregressive time series model. Specifically:

$$y(t) = g(t) + s(t) + h(t) + \varepsilon(t)$$


The Prophet model is not autoregressive, like ARIMA, exponential smoothing, and the other methods we study in a typical time series course (including my own).

The 3 components are:

1) The trend \( g(t) \) which can be either linear or logistic.

2) The seasonality \( s(t) \), modeled using a Fourier series.

3) The holiday component \( h(t) \), which is essentially a one-hot vector “dotted” with a vector of weights, each representing the contribution from their respective holiday.


How to use Prophet for predicting stock prices

In my course, we do 3 experiments. Our data is Google’s stock price from approximately 2013-2018, but we only use the first 2 years as training data.

The first experiment is “plug-and-play” into Prophet with the default settings.


Here are the results:

Unfortunately, Prophet mistakenly believes there is a weekly seasonal component, which is the reason for the little “hairs” in the forecast.

When we plot the components of the model, we see that Prophet has somehow managed to find some weekly seasonality.

Of course, this is completely wrong! The model believes that the stock price increases on the weekends, which is highly unlikely because we don’t have any data for the weekend.


The second experiment is an example of what not to do. I saw this in every other popular blog, which is yet another “data point” that should convince you not to trust these popular data science blogs you find online (except for mine, obviously).

In this experiment, we set daily_seasonality to True in the model constructor.


Here are the results.

It seems like those weird little “hairs” coming from the weekly seasonal component have disappeared.

“The Lazy Programmer is wrong!” you may proclaim.

However, this is because you may not understand what daily seasonality really means.

Let’s see what happens when we plot the components.

This plot should make you very suspicious. Pay attention to the final chart.

“Daily seasonality” pertains to a pattern that repeats everyday with sub-daily changes.

This cannot be the case, because our data only has daily granularity!

Lesson: don’t listen to those “popular” blogs.


For experiment 3, we set weekly seasonality to False. Alternatively, you could try playing around with the priors.


Here are the results.

Notice that the “little hairs” are again not present.


Is this model actually good?

Just because you can make a nice chart, does not mean you have done anything useful.

In fact, you see the exact same mistakes in those blog articles and terrible Udemy courses promising to “predict stock prices with LSTMs” (which I will call out every chance I get).

One of the major mistakes I see in nearly every blog post about predicting stock prices is that they don’t bother to compare it to a benchmark. And as you’ll see, the benchmark for stock prices is quite a low bar – there is no reason not to compare.

Your model is only useful if it can beat the benchmark.

For stock price predictions, the benchmark is typically the naive forecast, which is the optimal forecast for a random walk.

Random walks are often used as a model for stock prices since they share some common attributes.

For those unfamiliar, the naive forecast is simply where you predict the last-known value.

Example: If today’s price on July 5 is $200 and I want to make a forecast with a 5-day horizon, then I will predict $200 for July 6, $200 for July 7, …, and $200 for July 10.

I won’t bore you with the code (although it’s included in the course if you’re interested), but the answer is: Prophet does not beat the naive forecast.

In fact, it does not beat the naive forecast on any horizon I tried (5 days, 30 days, 60 days).

Sidenote: it’d be a good exercise to try 1 day as well.


How to learn more

Are stock prices really random walks? Although this particular example provides evidence supporting the random walk hypothesis, in my course, the GARCH section will provide strong evidence against it! Again, it’s all explained in my latest course, “Time Series Analysis, Forecasting, and Machine Learning“. Only the VIP version will contain the sections on Prophet, GARCH, and other important tools.

The VIP version is intended to be limited-time only, and the current coupon expires in less than one month!

Get your copy today while you still can.

Go to comments

Why do you need math for machine learning and deep learning?

July 9, 2021

In this article, I will demonstrate why math is necessary for machine learning, data science, deep learning, and AI.

Most of my students have already heard this from me countless times. College-level math is a prerequisite for nearly all of my courses already.

This article is a bit different.

Perhaps you may believe I am biased, because I’m the one teaching these courses which require all this math.

It would seem that I am just some crazy guy, making things extra hard for you because I like making things difficult.


You’ve heard it from me many times. Now you’ll hear it from others.

This article is a collection of resources where people other than myself explain the importance of math in ML.


Example #1

Let’s begin with one of the most famous professors in ML, Daphne Koller, who co-founded Coursera.

In this clip, Lex Fridman asks what advice she would have for those interested in beginning a journey into AI and machine learning.

One important thing she mentions, which I have seen time and time again in my own experience, is that those without typical prerequisite math backgrounds often make mistakes and do things that don’t make sense.

She’s being nice here, but I’ve met many of these folks who not only have no idea that what they are doing does not make sense, they also tend to be overly confident about it!

Then it becomes a burden for me, because I have to put in more effort explaining the basics to you just to convince you that you are wrong.

For that reason, I generally advise against hiring people for ML roles if they do not know basic math.


Example #2

I enjoyed this strongly worded Reddit comment.

Original post:

Top comment:


Example #3

Not exactly machine learning, but very related field: quant finance.

In fact, many students taking my courses dream about applying ML to finance.

Well, it’s going to be pretty hard if you can’t pass these interview questions.

Think about this logically: All quants who have a job can pass these kinds of interview questions. But you cannot. How well do you think you will do compared to them?


Example #4

Entrepreneur and angel investor Naval Ravikant explains why deriving (what we do in all of my in-depth machine learning courses) is much more important than memorizing on the Joe Rogan Experience.

Most beginner-level Udemy courses don’t derive anything – they just tell you random facts about ML algorithms and then jump straight to the usual 3 lines of scikit-learn code. Useless!

Link: (Skips to 1:33:30 automatically)



Example #5

I found this in a thread about Lambda School (one of the many “developer bootcamps” in existence these days) getting sued for lying about its job placement rates and cutting down on its staff.

Two interesting comments here from people “in the know” about how bootcamps did not really help unless the student already had a math / science / STEM background. The first comment is striking because it is written by a former recruiter (who has the ability to see who does and doesn’t get the job).

That is to say, it is difficult to go from random guy off the street to professional software engineer from just a bootcamp alone (the implication here is that we can apply similar reasoning to online courses).

In this case, it wasn’t even that the math was being directly applied. A math / science background is important because it teaches you how to think properly. If 2 people can complete a bootcamp or online course, but only one has a STEM background and knows how to apply what they learned, that one will get the job, and the other will not.

Importantly, note that it’s not about the credentials, it’s purely about ability, as per the comments below.



Example #6

This is from a thread concerning Yann LeCun’s deep learning course at NYU. As usual, someone makes a comment that you don’t need such courses when you can just plug your data into Tensorflow like everyone else. Another, more experienced developer sets them straight.



Example #7

Hey, you guys have heard of Yann LeCun, right? Remember that guy? The guy who invented CNNs?

Let’s see what he has to say:

Math. Math. Oh and perhaps some more math.

That’s the gist of the advice to students interested in AI from Facebook’s Yann LeCun and Joaquin Quiñonero Candela

 who run the company’s Artificial Intelligence Lab and Applied Machine Learning group respectively.

Tech companies often advocate STEM (science, technology, engineering and math), but today’s tips are particularly pointed. The pair specifically note that students should eat their vegetables take Calc I, Calc II, Calc III, Linear Algebra, Probability and Statistics as early as possible.



This article will be updated over time. Keep checking back!

Go to comments

Time Series: How to convert AR(p) to VAR(1) and VAR(p) to VAR(1)

July 1, 2021

This is a very condensed post, mainly just so I could write down the equations I need for my Time Series Analysis course. 😉

However, it you find it useful – I am happy to hear that!

[Get 75% off the VIP version here]

Start with an AR(2):

$$ y_t = b + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \varepsilon_t $$


Suppose we create a vector containing both \( y_t \) and \( y_{t -1} \):

$$\begin{bmatrix} y_t \\ y_{t-1} \end{bmatrix}$$


We can write our AR(2) as follows:

$$\begin{bmatrix} y_t \\ y_{t-1} \end{bmatrix} = \begin{bmatrix} b \\ 0 \end{bmatrix} + \begin{bmatrix} \phi_1 & \phi_2 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} y_{t-1} \\ y_{t-2} \end{bmatrix} + \begin{bmatrix} \varepsilon_t \\ 0 \end{bmatrix}$$


Exercise: expand the above to see that you get back the original AR(2). Note that the 2nd line just ends up giving you \( y_{t-1} = y_{t-1} \).

The above is just a VAR(1)!

You can see this by letting:

$$ \textbf{z}_t = \begin{bmatrix} y_t \\ y_{t-1} \end{bmatrix}$$

$$ \textbf{b}’ = \begin{bmatrix} b \\ 0 \end{bmatrix} $$

$$ \boldsymbol{\Phi}’_1 = \begin{bmatrix} \phi_1 & \phi_2 \\ 1 & 0 \end{bmatrix} $$

$$ \boldsymbol{\eta}_t = \begin{bmatrix} \varepsilon_t \\ 0 \end{bmatrix}$$.

Then we get:

$$ \textbf{z}_t = \textbf{b}’ + \boldsymbol{\Phi}’_1\textbf{z}_{t-1} + \boldsymbol{\eta}_t$$

Which is a VAR(1).


Now let us try to do the same thing with an AR(3).

$$ y_t = b + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \phi_3 y_{t-3} + \varepsilon_t $$


We can write our AR(3) as follows:

$$\begin{bmatrix} y_t \\ y_{t-1} \\ y_{t-2} \end{bmatrix} = \begin{bmatrix} b \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} \phi_1 & \phi_2 & \phi_3 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} y_{t-1} \\ y_{t-2} \\ y_{t-3} \end{bmatrix} + \begin{bmatrix} \varepsilon_t \\ 0 \\ 0 \end{bmatrix}$$

Note that this is also a VAR(1).


Of course, we can just repeat the same pattern for AR(p).


The cool thing is, we can extend this to VAR(p) as well, to show that any VAR(p) can be expressed as a VAR(1).

Suppose we have a VAR(3).

$$ \textbf{y}_t = \textbf{b} + \boldsymbol{\Phi}_1 \textbf{y}_{t-1} + \boldsymbol{\Phi}_2 \textbf{y}_{t-2} + \boldsymbol{\Phi}_3 \textbf{y}_{t-3} + \boldsymbol{ \varepsilon }_t $$


Now suppose that we create a new vector by concatenating \( \textbf{y}_t \), \( \textbf{y}_{t-1} \), and \( \textbf{y}_{t-2} \). We get:

$$\begin{bmatrix} \textbf{y}_t \\ \textbf{y}_{t-1} \\ \textbf{y}_{t-2} \end{bmatrix} = \begin{bmatrix} \textbf{b} \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} \boldsymbol{\Phi}_1 & \boldsymbol{\Phi}_2 & \boldsymbol{\Phi}_3 \\ I & 0 & 0 \\ 0 & I & 0 \end{bmatrix} \begin{bmatrix} \textbf{y}_{t-1} \\ \textbf{y}_{t-2} \\ \textbf{y}_{t-3} \end{bmatrix} + \begin{bmatrix} \boldsymbol{\varepsilon_t} \\ 0 \\ 0 \end{bmatrix}$$

This is a VAR(1)!



Go to comments

Deep Learning and Artificial Intelligence Newsletter

Get discount coupons, free machine learning material, and new course announcements