TL;DR: this is an article about how to predict stocks using the news.
In this article, we are going to do an exercise involving my 2 current favorite subjects: natural language processing and financial engineering!
I’ll present this as an exercise / tutorial, so hopefully you can follow along on your own.
One comment I frequently make about predicting stocks is that autoregressive time series models aren’t really a great idea.
Basic analysis (e.g. ACF, PACF) shows no serial correlation in returns (that is, there’s no correlation between past and future) and hence, the future is not predictable from the past.
The best-fitting ARIMA model is more often than not, a simple random walk.
What is a random walk? If you haven’t yet learned this from me, then basically think of it like flipping a coin at each time step. The result of the coin flip tells you which way to walk: up the street or down the street.
Just as you can’t predict the result of a coin flip from past coin flips (by the way, this is essentially the gambler’s fallacy!), so too is it impossible to predict the next step of a random walk.
In these situations, the best prediction is simply the last-known value.
This is why, when one tries to fit an LSTM to a stock price time series, all it ends up doing is predicting close to the previous value.
There is a nice quote which is unfortunately (as far as I know) unattributed, that says something like: “trying to predict the future from the past is like trying to drive by looking through the rearview mirror”.
Anyway, this brings us to the question: “If I don’t use past prices, then what do I use?”
One common approach is to use the news.
We’ve all seen that news and notable events can have an impact on stock / cryptocurrency prices. Examples:
- The Omicron variant of COVID-19
- High inflation
- Supply-chain issues
- Elon Musk tweeting about Dogecoin
- Mark Zuckerberg being grilled by the government
Luckily, I’m not going to make you scrape the web to download news yourself.
Instead, we’re going to use a pre-built dataset, which you can get at: https://www.kaggle.com/aaron7sun/stocknews
Briefly, you’ll want to look at the “combined” CSV file which has the following columns:
- Date (e.g. 2008-08-11 – daily data)
- Label (0 or 1 – whether or not the DJIA went up or down)
- Top1, Top2, …, Top25 (news in the form of text, retrieved from the top 25 Reddit news posts)
Note that this is a binary classification problem.
Thanks to my famous rule, “all data is the same“, your code should be no different than a simple sentiment analysis / spam detection script.
To start you off, I’ll present some basic starter code / tips.
Tip 1) Some text contains weird formatting, e.g.
b”Georgia ‘downs two Russian warplanes’ as cou…
Basically, it looks like how a binary string would be printed out, but the “b” is part of the actual string.
Here’s a simple way to remove unwanted characters:
Tip 2) Don’t forget that this is time-ordered data, so you don’t want to do a train-test split with shuffling (mixing future and past in the train and test sets). The train set should only contain data that comes before the test set.
Tip 3) A simple way to form feature vectors from the news would be to just concatenate all 25 news columns into a single text, and then apply TF-IDF. E.g.
I’ll leave the concatenation part as an exercise for you.
Here are some extra thoughts to consider:
- How were the labels created? Does that method make sense? Is it based on close-close or open-close?
- What were the exact times that the news was posted? Was there sufficient time between the latest news post and the result from which the label is computed?
- Returns tend to be very noisy. If you’re getting something like 85% test accuracy, you should be very suspicious that you’ve done something wrong. A more realistic result would be around 50-60%. Even 60% would be considered suspiciously high.
So that’s basically the exercise. It is simple, yet hopefully thought-provoking.
Now I didn’t know where else to put this ML news I found recently, but I enjoyed it so I want to share it with you all.
First up: “Chatbots: Still Dumb After All These Years”
I enjoyed this article because I get a lot of requests to cover Chatbots.
Unfortunately, Chatbot technology isn’t very good.
Previously, we used seq2seq (and also seq2seq with attention) which basically just learns to copy canned responses to various inputs. seq2seq means “sequence to sequence” so the input is a sequence (a prompt) and the target/output is a sequence (the chatbot’s response).
Even with Transformers, the best results are still lacking.
Next: “PyTorch vs TensorFlow in 2022”
Wait, people are still talking about this in 2022? You betcha!
Read this article. It says a lot of the same stuff I’ve been saying myself. But it’s nice to hear it from someone else.
It also provides actual metrics which I am too lazy to do.
This isn’t really “new news” (in fact, Facebook isn’t even called Facebook anymore) but I recently came across this old article I saved many years earlier.
Probably the most common beginner question I get is “why do I need to do all this math?” (in my ML courses).
You’ve heard the arguments from me hundreds of times.
Perhaps you are hesitant to listen to me. That would be like listening to your parents. Yuck.
Instead, why not listen to Yann LeCun? Remember that guy? The guy who invented CNNs?
He’s the Chief AI Scientist at Facebook (Meta) now, so if you want a job there, you should probably listen to his advice…
And if you think Google, Netflix, Amazon, Microsoft, etc. are any different, well, that is wishful thinking my friends.
What do you think?
Is this convincing? Or is Yann LeCun just as wrong as I am?
Let me know!