In this article, we are again going to combine my current favorite subjects: natural language processing, time series analysis, and financial analysis.

Recently, I created a couple lectures covering Granger causality, so this topic is fresh on my mind.

In short, Granger causality is used to determine whether one time series can be used to forecast another (i.e. predict the future).

In these lectures, I demonstrated that some economics variables are Granger causal (in particular, GDP and term spread).

Of course, another easy application is to determine whether or not Twitter sentiment can predict cryptocurrency movements.

This post is based on this short publication: “Does Twitter Predict Bitcoin?” by Shen, D., Urquhart, A. and Wang, P. (2019) and can be found at https://centaur.reading.ac.uk/80420/1/Twitter.Bitcoin.pdf

The premise is quite simple and you really have to just understand these 3 components in order to implement this yourself:

1) How to get a Twitter sentiment time series

2) How to get Bitcoin price time series

3) How to implement the Granger causality test

If you can do 1-3, you can predict Bitcoin! (at least, partially)

So let’s go over each of these 3 topics in order.

### How to get a Twitter sentiment time series

This is going to probably be the most difficult part for most students. Most students are used to downloading a CSV dataset that I typically make very nice and simple for my courses.

Unfortunately, real life is not like this.

This becomes a data engineering problem.

Which tweets by which authors do you choose?

How do you use Twitter’s API to download the tweets?

Where do you store the tweets?

Once you’ve figured that out, you need to convert the tweets into a number (sentiment) such that the numbers collectively form a time series.

That part is not so hard.

I’ve demonstrated several methods of doing this, such as:

a) training your own model on sentiment data (you could even create your own dataset)

b) using a pretrained Transformer model

### How to get Bitcoin price time series

In contrast to the first task, this is probably the easiest.

In the past, I’ve demonstrated how you can easily get minute, daily, monthly, etc. data for essentially any ticker using the yfinance Python package.

### How to implement the Granger causality test

For those of you who haven’t learned Time Series Analysis with me in the past, you perhaps have never heard of Granger causality.

In short, we build a multivariate autoregressive time series model called a VAR model.

It takes the form of:

$$y(t) = \sum_{\tau=1}^L A_\tau y(t-\tau) + \varepsilon(t)$$

Essentially, if you find any component \( A_\tau(j,i) \) is “big enough” (in magnitude), then you can conclude that \( y_i(t) \) Granger causes \( y_j(t) \).

As in regression analysis, one decides whether these model coefficients are statistically significant by using hypothesis testing.

It’s important to note that Granger causality is not “true” causality as one usually thinks of it (e.g. eating food *causes* me to be satiated). Granger causal simply means that one time series is useful in forecasting another (hence the cross-coefficients being non-zero).

Luckily, the Granger causality test is very easy to use in Python with the statsmodels package.

Suppose you have your 2 time series (BTC returns and Twitter sentiment) in a 2-column dataframe (sidenote: your time series should beĀ **stationary** so you should use returns and not prices).

Then you simply call the statsmodels function:

This will output p-values for every lag so you can see whether or not the sentiment at that particular lag affects the BTC return.

Final note: unfortunately, the paper only shows that Twitter sentiment Granger causes some function of the squared return. This means we lose information about whether the return is actually going up or down!