# Convert a Time Series Into an Image with Gramian Angular Fields and Markov Transition Fields

August 30, 2021

In my latest course (Time Series Analysis), I made subtle hints in the section on Convolutional Neural Networks that instead of using 1-D convolutions on 1-D time series, it is possible to convert a time series into an image and use 2-D convolutions instead.

CNNs with 2-D convolutions are the “typical” kind of neural network used in deep learning, which normally are used on images (e.g. ImageNet, object detection, segmentation, medical imaging and diagnosis, etc.)

In this article, we will look at 2 ways to convert a time series into an image:

1. Gramian Angular Field
2. Markov Transition Field

## Gramian Angular Field

The Gramian Angular Field is quite involved mathematically, so this article will discuss the intuition only, along with the code.

Those interesting in all the gory details are encouraged to read the paper, titled “Encoding Time Series as Images for Visual Inspection and Classification Using Tiled Convolutional Neural Networks” by Zhiguang Wang and Tim Oates.

We’ll build the intuition in a series of steps.

Let us begin by recalling that the dot product or inner product is a measure of similarity between two vectors.

$$\langle a, b\rangle = \lVert a \rVert \lVert b \rVert \cos \theta$$

Where $$\theta$$ is the angle between $$a$$ and $$b$$.

Ignoring the magnitude of the vectors, if the angle between them is small (i.e. close to 0) then the cosine of that angle will be nearly 1. If the angle is perpendicular, the cosine of the angle is 0. If the two vectors are pointing in opposite directions, then the cosine of the angle will be -1.

The Gram Matrix is just the repeated application of the inner product between every vector in a set of vectors, and every other vector in that same set of vectors.

i.e. Suppose that we store a set of column vectors in a matrix called $$X$$.

The Gram Matrix is:

$$G = X^TX$$

This expands to:

$$G = \begin{bmatrix} \langle x_1, x_1 \rangle & \langle x_1, x_2 \rangle & … & \langle x_1, x_N \rangle \\ \langle x_2, x_1 \rangle & \langle x_2, x_2 \rangle & … & \langle x_2, x_N \rangle \\ … & … & … & … \\ \langle x_N, x_1 \rangle & \langle x_N, x_2 \rangle & … & \langle x_N, x_N \rangle \end{bmatrix}$$

In other words, if we think of the inner product as the similarity between two vectors, then the Gram Matrix just gives us the pairwise similarity between every vector and every other vector.

Note that the Gramian Angular Field (GAF) does not apply the Gram Matrix directly (in fact, each value of the time series is a scalar, not a vector).

The first step in computing the GAF is to normalize the time series to be in the range [-1, +1].

Let’s assume we are given a time series $$X = \{x_1, x_2, …, x_N \}$$.

The normalized values are denoted by $$\tilde{x_i}$$.

The second step is to convert each value in the normalized time series into polar coordinates.

We use the following transformation:

$$\phi_i = \arccos \tilde{x_i}$$

$$r_i = \frac{t_i}{N}$$

Where $$t_i \in \mathbb{N}$$ represents the timestamp of data point $$x _i$$.

Finally, the GAF method defines its own “special” inner product as:

$$\langle x_1, x_2 \rangle = \cos(\phi_1 + \phi_2)$$

From here, the above formula for $$G$$ still applies (except using $$\tilde{X}$$ instead of $$X$$, and using the custom inner product instead of the usual version).

Here is an illustration of the process:

So why use the GAF?

Like the original Gram Matrix, it gives you a “picture” (no pun intended) of the relationship between every point and every other point in the time series.

That is, it displays the temporal correlation structure in the time series.

Here’s how you can use it in code.

Firstly, you need to install the pyts library. Then, run the following code on a time series of your choice:

Note that the library allows you to rescale the image with the image_size argument.

As an exercise, try using this method instead of the 1-D CNNs we used in the course and compare their performance!

## Markov Transition Field

The Markov Transition Field (MTF) is another method of converting a time series into an image.

The process is a bit simpler than that of the GAF.

If you have taken any of my courses which involve Markov Models (like Natural Language Processing, or HMMs) you should feel right at home.

Let’s assume we have an N-length time series.

We begin by putting each value in the time series into quantiles (i.e. we “bin” each value).

For example, if we use quartiles (4 bins), the smallest 25% of values would define the boundaries of the first quartile, the second smallest 25% of values would define the boundaries of the second quartile, etc.

We can think of each bin as a ‘state’ (using Markov model terminology).

Intuitively, we know that what we’d like to do when using Markov models is to form the state transition matrix.

This matrix has the values:

$$A_{ij} = P(s_t = j | s_{t-1} = i)$$

That is, $$A_{ij}$$ is the probability of transitioning from state i to state j.

As usual, we estimate this value by maximum likelihood. ( $$A_{ij}$$ is the count of transitions from i to j, divided by the total number of times we were in state i).

Note that if we have $$Q$$ quantiles (i.e. we have $$Q$$ “states”), then $$A$$ is a $$Q \times Q$$ matrix.

The MTF follows a similar concept.

The MTF (denoted by $$M$$) is an $$N \times N$$ matrix where:

$$M_{kl} = A_{q_k q_l}$$

And where $$q_k$$ is the quantile (“bin”) for $$x_k$$, and $$q_l$$ is the quantile for $$x_l$$.

Note: I haven’t re-used the letters i and j to index $$M$$, which most resources do and it’s super confusing.

Do not mix up the indices for $$M$$ and $$A$$! The indices in $$A$$ refer to states. The indices for $$M$$ are temporal.

$$A_{ij}$$ is the probability of transitioning from state i to state j.

$$M_{kl}$$ is the probability of a one-step transition from the bin for $$x_k$$, to the bin for $$x_l$$.

That is, it looks at $$x_k$$ and $$x_l$$, which are 2 points in the time series at arbitrary time steps $$k$$ and $$l$$.

$$q_k$$ and $$q_l$$ are the corresponding quantiles.

$$M_{kl}$$ is then just the probability that we saw a direct one-step (i.e. Markovian) transition from $$q_k$$ to $$q_l$$ in the time series.

So why use the MTF?

It shows us how related 2 arbitrary points in the time series are, relative to how often they appear next to each other in the time series.

Here’s how you can use it in code.

Note that the library allows you to rescale the image with the image_size argument.

As an exercise, try using this method instead of the 1-D CNNs we used in the course and compare their performance

Enjoy!

# Should you study the theory behind machine learning?

August 23, 2021

In this post, I want to discuss why you should not study the theory behind machine learning.

This may surprise some of you, since my courses can appear to be more “theoretical” than other ML courses on popular websites such as Udemy.

However, that is not the kind of “theory” I am talking about.

Most popular courses in ML don’t look at any math at all.

They are popular precisely for this reason: lack of math makes them accessible to the average Joe.

This does a disservice to you students, because you end up not having any solid understanding about how the algorithm works.

You may end up:

• doing things that don’t make sense, due to that lack of understanding.
• only being able to copy code from others, but not write any code yourself.
• not knowing how to apply algorithms to new kinds of data, without someone showing you how first.

For more discussion on that, see my post: “Why do you need math for machine learning and deep learning?

But let’s make this clear: math != theory.

When we look at math in my courses, we only look at the math needed to derive the algorithm and understand how it works at an intuitive level.

Yes, believe it or not, we are using math to improve our intuition.

This is despite what many beginners might think. When they see math, they automatically assume “math” = “not intuitive”, and that “intuitive” = “pictures, animations, and purposely avoiding math”.

That’s OK if you want to read a news article in the NY Times about ML, but not when you want to be a practitioner of ML.

Those are 2 different levels of “intuition” (layman vs. practitioner).

To see an extreme example of this, one need not look any further than Albert Einstein. Einstein was great at communicating his ideas to the public. Everyone can easily understand the layman interpretation of general relativity (mass bends space and time). But this is not the same as being a practitioner of relativistic physics.

Everyone has seen this picture and understands what it means at a high level. But does that mean you are a physicist or that you can “do physics”?

Anyway, that was just an aside so we don’t confuse “math used for intuition” and “layman intuition” and “theory”. These are 3 separate things. Just because you’re looking at some math, does not automatically imply you’re looking at “theory”.

What do we mean by “theory”?

Here’s a simple question to consider. Why does gradient descent work?

Despite the fact that we have used gradient descent in many of my courses, and derived the gradient descent update rules for neural networks, SVMs, and other models, we have never discussed why it works.

And that’s OK!

The “mathematical intuition” is enough.

But let’s get back to the question of this article: Why is the Lazy Programmer saying we should not study theory?

Well, this is the kind of “theory” that gets so deep, it:

• Does not produce any near-term gains in your work
• Requires a very high level of math ability (e.g. real analysis, optimization, dynamical systems)
• Is on the cutting-edge of understanding, and thus very difficult, likely to be disputed or even superseded in the near future

Case in point: although we have been using gradient descent for years in my courses (and decades before that in general), our understanding is still not yet complete.

Here’s an article that just came out this year on gradient descent (August 2021): “Computer Scientists Discover Limits of Major Research Algorithm“.

Here’s a direct link to the corresponding paper, called “The Complexity of Gradient Descent: CLS = PPAD ∩ PLS”: https://arxiv.org/abs/2011.01929

There will be more papers on these “theory” topics in the years to come.

My advice is not to go down this path, unless you really enjoy it, you are doing graduate research (e.g. PhD-level), you don’t mind if ideas you spent years and years working on might be proven incorrect, and you have a very high level of math ability in subjects like real analysis, optimization, and dynamical systems.

# Predicting Stock Prices with Facebook Prophet

August 3, 2021

Prophet is Facebook’s library for time series forecasting. It is mainly geared towards business datasets (e.g. predicting adspend or CPU usage), but a natural question that comes up with my students whenever we talk about time series is: “can it predict stock prices?”

In this article, I will discuss how to use FB Prophet to predict stock prices, and I’ll also show you what not to do (things I’ve seen in other popular blogs). Furthermore, we will benchmark the Prophet model with the naive forecast, to check whether or not one would really want to use this.

Note: This is an excerpt from my full VIP course, “Time Series Analysis, Forecasting, and Machine Learning“. If you want the code for this example, along with many, many other code examples on stock prices, sales data, and smartphone data, get the course!

The Prophet section will be part of the VIP version only, so get it now while the VIP coupon is still active!

## How does Prophet work?

The Prophet model is a 3 component, non-autoregressive time series model. Specifically:

$$y(t) = g(t) + s(t) + h(t) + \varepsilon(t)$$

The Prophet model is not autoregressive, like ARIMA, exponential smoothing, and the other methods we study in a typical time series course (including my own).

The 3 components are:

1) The trend $$g(t)$$ which can be either linear or logistic.

2) The seasonality $$s(t)$$, modeled using a Fourier series.

3) The holiday component $$h(t)$$, which is essentially a one-hot vector “dotted” with a vector of weights, each representing the contribution from their respective holiday.

## How to use Prophet for predicting stock prices

In my course, we do 3 experiments. Our data is Google’s stock price from approximately 2013-2018, but we only use the first 2 years as training data.

The first experiment is “plug-and-play” into Prophet with the default settings.

Here are the results:

Unfortunately, Prophet mistakenly believes there is a weekly seasonal component, which is the reason for the little “hairs” in the forecast.

When we plot the components of the model, we see that Prophet has somehow managed to find some weekly seasonality.

Of course, this is completely wrong! The model believes that the stock price increases on the weekends, which is highly unlikely because we don’t have any data for the weekend.

The second experiment is an example of what not to do. I saw this in every other popular blog, which is yet another “data point” that should convince you not to trust these popular data science blogs you find online (except for mine, obviously).

In this experiment, we set daily_seasonality to True in the model constructor.

Here are the results.

It seems like those weird little “hairs” coming from the weekly seasonal component have disappeared.

“The Lazy Programmer is wrong!” you may proclaim.

However, this is because you may not understand what daily seasonality really means.

Let’s see what happens when we plot the components.

This plot should make you very suspicious. Pay attention to the final chart.

“Daily seasonality” pertains to a pattern that repeats everyday with sub-daily changes.

This cannot be the case, because our data only has daily granularity!

Lesson: don’t listen to those “popular” blogs.

For experiment 3, we set weekly seasonality to False. Alternatively, you could try playing around with the priors.

Here are the results.

Notice that the “little hairs” are again not present.

## Is this model actually good?

Just because you can make a nice chart, does not mean you have done anything useful.

In fact, you see the exact same mistakes in those blog articles and terrible Udemy courses promising to “predict stock prices with LSTMs” (which I will call out every chance I get).

One of the major mistakes I see in nearly every blog post about predicting stock prices is that they don’t bother to compare it to a benchmark. And as you’ll see, the benchmark for stock prices is quite a low bar – there is no reason not to compare.

Your model is only useful if it can beat the benchmark.

For stock price predictions, the benchmark is typically the naive forecast, which is the optimal forecast for a random walk.

Random walks are often used as a model for stock prices since they share some common attributes.

For those unfamiliar, the naive forecast is simply where you predict the last-known value.

Example: If today’s price on July 5 is $200 and I want to make a forecast with a 5-day horizon, then I will predict$200 for July 6, $200 for July 7, …, and$200 for July 10.

I won’t bore you with the code (although it’s included in the course if you’re interested), but the answer is: Prophet does not beat the naive forecast.

In fact, it does not beat the naive forecast on any horizon I tried (5 days, 30 days, 60 days).

Sidenote: it’d be a good exercise to try 1 day as well.

Are stock prices really random walks? Although this particular example provides evidence supporting the random walk hypothesis, in my course, the GARCH section will provide strong evidence against it! Again, it’s all explained in my latest course, “Time Series Analysis, Forecasting, and Machine Learning“. Only the VIP version will contain the sections on Prophet, GARCH, and other important tools.

The VIP version is intended to be limited-time only, and the current coupon expires in less than one month!

Get your copy today while you still can.