All Data is the Same

In this brief post, I am going to discuss my motto, “all data is the same”. Many years ago, when I began making courses, I created this motto because I found that many beginner students didn’t understand the purpose or goal of a machine learning course.

 

The origin of this phrase

Beginners often ask questions such as:

  • “How can I apply this algorithm in the ‘real world’?” (after using it on synthetic data)
  • “How can I apply this algorithm to my dataset?”
  • “How can I apply this algorithm for fraud detection?”
  • “How can I apply this algorithm to sentiment classification?”
  • “How can I apply this algorithm to disease prediction?”

The basic answer is: there is no difference (in relation to the code I gave you in the course).

The code to apply some algorithm in any of these cases is exactly the same.

Because the code is the same and requires no change to adapt to different datasets, this means that in the eyes of the machine learning model, “all data is the same”.

“Data” is just a table of numbers.

ML algorithms don’t care if your data comes from biology, finance, ecology, physics, etc. 

A table of numbers is just a table of numbers. The “real world” meaning is irrelevant to the ML algorithm.

If you want to apply an algorithm from the course to your dataset, no change is required.

This question typically arises when beginner students get frustrated about learning the “theory” behind an algorithm.

They get flustered because there’s math and beginners tend to have very poor math skills.

They want to skip the math and go straight to “applying the algorithm to real-world data”.

As a side note: that’s not what it means to ‘learn’ machine learning. To ‘learn’ machine learning is to learn how the models actually work.

If you want to ‘apply ML to data’, this is trivial. It should take no more than 15 minutes to learn how to plug your data into a scikit-learn model with 3 lines of code. You don’t need a 20 hour course for that.

 

Why what you want doesn’t work

So, why can’t I just make a course showing you examples of how to apply ML algorithms to whatever so-called “real world data” you are interested in?

Problem: students are only interested in their own problems, not the problems of other students!

It is an inherently selfish desire, but furthermore, it’s a desire that can never be satisfied, because no two students are interested in the same “real-world” examples and applications.

You can’t do a biology example, because the finance students would not understand it. You can’t do a finance example, because the biology students would not understand it.

The goal isn’t to learn the finance part. The goal isn’t to learn the biology part. It’s to learn the machine learning part!

Finance and biology are what we call “domain knowledge”. That’s the part you learn by yourself during your finance degree or your biology degree. It is not part of a machine learning course.

The goal then, is to use simple examples that everyone can understand easily. Especially important are visualizable examples. I often use the “Gaussian clouds” in my courses, because it provides geometric intuition for what machine learning is actually doing.

When you realize that all you’re trying to do is separate the purple dots from the red dots, you realize that it’s not magic after all.

A beginner may say: “Yeah, but Gaussian clouds are not REAL DATA!”

This is not correct thinking.

The correct thinking is: it doesn’t matter what the data is. The code would be the same anyway.

The most important fact to realize is that the point of “learning” machine learning isn’t that 3 lines of Scikit-Learn code.

You should be able to do that all by yourself after spending a few minutes reading the documentation.

“Learning” machine learning means learning what goes on inside those 3 lines of Scikit-Learn code, and realizing that it encapsulates perhaps tens or hundreds of lines of code. True understanding and competence would be the ability to implement that code yourself, without needing Scikit-Learn.

Furthermore, that 3 lines of Scikit-Learn code is the same for any dataset.

You wouldn’t ask, “How can I adapt this algorithm to work on my finance dataset?”

The algorithm doesn’t change just because you are using your special finance dataset or your special biology dataset.

Linear regression is always linear regression, the same linear regression that has existed for hundreds of years. There’s no such thing as “linear regression for biology” or “linear regression for finance”.

Linear regression is: \(w = (X^TX)^{-1}X^Ty\). This is the case for any dataset.

That’s why “all data is the same”.


Further notes:

If it takes you 20+ minutes to understand these 3 lines of code:

model = RandomForest()
model.fit(X, Y)
model.predict(X)

then something is wrong! Aim to reach a higher level of understanding.

 

Attempted Rebuttals

Very infrequently, a student will argue against this motto, out of misinterpretation, misunderstanding, or simply lacking the ability to think abstractly. It is unfortunate that people are stubbornly against learning new things – after all, isn’t that why you’re taking a course?

Here are some rebuttals I’ve come across:

The API for ML models only appears to be the same because they’re built that way! For example, SVMs and Decision Trees operate in a completely different manner!

Reply: This misconception arises from conflating “interface” with “implementation”. Any decent programmer should be able to distinguish between the two.

Regardless, it doesn’t matter, because in the end, the interface, which is what the “I” in “API” stands for, is in fact, the same, and therefore, all the practitioner has to worry about is that the data fits the interface.

 

But what about models like LSTMs? Certainly, the interface for an LSTM is different from the interface for a Random Forest.

Reply: This misconception arises from overgeneralizing the concept “all data is the same”. This motto doesn’t imply that tabular data is the same as sequence data is the same as image data etc.

Obviously that is not the case, so one has to apply nuance (more likely, just common sense).

The motto says that an ML model can and does treat the tabular data from one dataset the exact same as the tabular data from a different dataset.

An LSTM will treat the sequence data from one dataset (e.g. English text) the exact same as the sequence data from a different dataset (e.g. DNA sequences).