Beginner’s guide to debugging (programming, data science, machine learning)

In this post, I will discuss some common techniques I use for debugging that have helped me immensely in my years as a data scientist and software engineer.

Aside from those techniques, I will also discuss some very common errors made by beginners when trying to get the course code to work. Note: it should not require much effort. Simply running the script / notebook should suffice. But as you’ll see, there are some “corner cases”.

Context for this post: this post is written with the students of my courses in mind. Often, they are trying to either run my code (which they got from Github or Google Colab), or they tried to type out my code themselves. This post is designed for students who have problems with this process (i.e. they ran what they thought was the correct code, but it produced an incorrect or unexpected output).

But, back to my debugging techniques.

With these techniques, there has essentially never been a bug I have not been able to solve.

At the same time, these techniques are somewhat “high-level”. They describe a mindset and approach rather than specific things to try.

I can’t tell you specifics because every situation is different. And in any case, if that’s what you’re looking for, that’s the first sign that your mindset is incorrect.

This brings us to…


Why am I making this post?

Normally, I’d just make a post on some data science technique – algorithms, code, or some combination thereof.

But in my courses, too often I see students struggle with the most basic of debugging tasks.

As if they had never written code in their lives.

I see students getting stuck with things that people who practice coding should not be getting stuck with.


The right mindset

Students often come to class with the wrong mindset.

This wrong mindset is: “I click the play button, I watch the code go. Easy!”

If only real life were that easy.

There is of course, some expectation that when you join a course, the code will work. This is obvious. However, what you will see is that, many issues are caused by the student themselves. It is very difficult to make things foolproof. My claim is that you should not want the code to be foolproof. That’s equivalent to calling yourself a fool! Things like changing my code and then forgetting that you changed it, while insisting that it’s totally “the same” as my code… I have no control when you do things like that. Sometimes, you must take responsibility for your own mistakes.

Obviously, you should always use the Q&A, whether the issue was caused by me (e.g. I made a typo) or you. I will test the official version of the code and let you know whether or not it’s still working. If there is no issue on my end, then you must assume it’s on yours.

The right mindset is:

1) “I believe I can solve this problem by investigating every potential source of error”

Most people just give up when they see an error. That’s definitely not how you debug. In the real world, you can’t ask other people for help. If you always need others to solve problems for you, your manager / team lead will find a more self-reliant employee.

2) “It’s not everything around me that’s wrong, it’s me that’s wrong”

Most people cannot accept that their thinking is wrong. But if their thinking was perfect, then clearly the code would run without issue.

This is cognitive dissonance. There’s a contradiction between “My thinking is right” and “The program I think is right doesn’t work”.

Human brains can’t handle contradictions like this. Therefore, you have to accept that one of these assumptions is wrong.

Normally, it’s not the world around you that’s wrong, it’s you that’s wrong.

If you presume that you’re not wrong, then you’ll never discover the problem, because the problem is due to one of your incorrect assumptions.

3) “Never give up”

Most beginner students simply give up. “I’ve tried everything possible”, they proclaim.

This is false.

Again, this goes back to your mental contradictions.

If you tried everything possible, you would have tried something that would uncover the source of the error.

What you really mean is: you lack the creativity to come up with more things to try.

Let’s be clear: this is your fault.

Take responsibility.

Once you take responsibility, then you make the right next steps to improve.

You already know it’s possible to solve the error.

By definition, there is a right way (at least one) and many wrong ways to write the computer program. Currently, you have one of the wrong ways.

But we must assume that a “right way” exists. Otherwise, why are you wasting your time?

Therefore, all that needs to be done is to work harder and dig deeper, until you discover the problem.

Never give up. It is as simple as that.


Common Errors Made By Beginners

Now that we’ve discussed the general high-level mindset for debugging, let’s move on to common errors made by beginners.


Warning: Some readers may find this offensive

The mistakes I’m about to point out are so simple, so against common sense, that you may even be offended that I would suggest you could be susceptible to such mistakes.

Please be aware: these are real mistakes that I’ve seen time and time again.

People promise me, 2, 3, even 10 times that they did not make these mistakes.

After arguing back and forth for days, finally they decide to check for themselves and they go, “oh yeah, I guess I really did make that mistake”.

Don’t be one of these people.

There’s a reason these are called common mistakes.

There’s no reason to be offended.


Common Error: You changed the code

This is another mistake students find quite offensive.

“I did not modify your code!” they promise.

I usually have to ask more than once. “Are you sure?” “Are you really really sure?”

It’s so common I usually just copy and paste the same response:

“The #1 source of student error is copying code incorrectly. Therefore, the first thing one should always do is cross-reference with the working code.”

This is not being “lazy”. This is being efficient, because 99% of the time, this is the solution.

After nagging and nagging and nagging, they finally admit:

  • “Oh yeah I wrote my own code to download the data file”
  • “Oh yeah I changed the URL of the data file”
  • “Oh yeah I didn’t bother to run your notebook to confirm everything works as expected”
  • “Oh yeah, I accidentally changed the indentation”

LESSON: Don’t be offended, just accept that my advice is based on real, practical experience.

Run my code. If mine works, and yours doesn’t, then they are not the same, by definition.

It should take no further convincing.


Common Error: Not cross-referencing with the working code

Recall, my famous copy-paste response:

“The #1 source of student error is copying code incorrectly. Therefore, the first thing one should always do is cross-reference with the working code.”

The key is that you must cross-reference with my code.

This is obviously the best and most efficient way to figure out what you did wrong – not to ask me to fix your code for you.

I’ve already given you the right answer! Look at it!

If you have 2 pieces of code, Code A and Code B, and they don’t do the same thing – you should be able to discern the difference. This has nothing to do with coding ability or experience level. You don’t have to be a master, or even a novice. It’s just a test of your attention to detail.

It’s like showing 2 slightly different paintings to a child and asking them to spot the difference.

If you have poor attention to detail / poor work habits, then of course, you will not be so good at this.

Your attitude must change.


Common Error: You don’t have the right data file

Believe it or not, people don’t even bother to check the data they’ve downloaded.

Is it the same as what’s in the course?

Is it named *.csv, but not actually a CSV?

People download the wrong things for all sorts of reasons. It’s best to just save yourself the embarrassment and double check.

How crazy is it that people don’t even bother to look at the data?

How can one call themselves a “data scientist” if they do even take this first step?

This brings us to the next issue…


Common Error: You didn’t use Github correctly

People try to download files from Github by using the URL.

This is incorrect.

This is not how to download files from Git or Github.

Please use Git correctly, or learn how to obtain the correct URL to download the file correctly.

If you have an HTML file instead of a CSV file (even though it may still have a CSV extension), this is you.

You’ve downloaded the webpage that displays the file, not the file itself.

But again, this can be solved simply by not making the previous common error, which is to look at the files you’ve downloaded.


Common Error: You have the wrong version of some library

Often, students sign up for courses I made about 5 years ago.

Of course, this means you may have to use library versions from 5 years ago to match what was made at that time.

Sometimes, the API for those libraries hasn’t changed enough to break the old code, but be aware of this.


Common Error: Mistaking an error or exception with a warning

Please note that many of the libraries we use in data science and machine learning are in their infancy.

Everything is always changing (sometimes on a week to week or month to month basis).

Simply using the latest version of everything will not suffice.

Sometimes, one version of one library will not get along with another version of another library.

That’s just the way it is.

You should contact the author of those libraries yourself if you want to tell them how you feel.

So sometimes warnings popup for whatever reason.

But warnings are not the same as errors. When an error or exception occurs, your program will terminate and the error will be printed out.

When a warning occurs, most of the time it is innocuous.


Common Error: File Handling on Windows

I sometimes (but not often) write code that is meant to run on UNIX-like systems (Linux, Mac) but is not compatible with Windows. Most of the time, this has to do with opening text files.

Windows is notorious for its poor text handling and compatibility with Linux.

Generally speaking, for my courses this is easily solved by using:

open(filename, encoding=’utf8′) instead of open(filename)

And this goes for both reading and writing files.


Strategy: Find the first error

One of the saddest things I see is when students see an error, and only focus on the line that resulted in that error.

This is wrong, and it hints at a very poor understanding of how computer programs work.

The real problem could be the one line back, 2 lines back, maybe all the way at the beginning of the script.

This is sort of in line with my other famous rule, “machine learning is experimentation, not philosophy”.

The problem is, students try to use philosophy to determine the source of the error.

They try to “think their way out of the problem”.

This is the wrong strategy.

It is always highly suboptimal to mentally run a program (i.e. think about what a program must be doing using only your mind), vs. actually running computer code.

This should be your last resort when you can think of nothing else.

Trying to run a computer program in your mind to essentially guess what it is doing is an awful approach. Don’t do it.

Instead, learn how to use actual code to trace back to the first instance of something going wrong.

Compare the value of some variable between what it actually is and what it should be.

The key is to find the first instance of where what you expect is not the same as what is there, thereby solving the problem.


Strategy: When in doubt, print it out

This is related to the above.

Students often don’t know how to employ the above strategy because they don’t think of checking variable values.

This is so easy.

If you want to know the value of a variable, just print it out!