What’s with all those single-letter variable names in machine learning code?

December 27, 2018

In this article, I want to discuss a common beginner question, which is:

“What’s with all those single-letter variable names in machine learning code?”

This article is part of a series I started on Common Beginner Questions, so check that out if you’d like to see more.


The short answer

The short answer to this question is many-fold. To summarize:

1) We follow conventions. When you’re on a team, you follow your team’s conventions. In ML, this happens to be pretty conventional.

2) It directly follows the math. If you have a math equation like \( w^T x+b \), then a direct translation into Numpy would look like “w @ x + b” where the code variables match the names of math variables. It is easy to follow. It is not easy to follow when you rename everything to look like “weights @ inputs + bias”. This makes things harder, not easier.

3) Don’t use analogies that were meant for learning purposes in the lectures. For example, don’t call something a “neuron” in your code. Don’t call something a “slot machine” or “animal type” or “digit score”. Again, this makes things more confusing.


EXAMPLES (from the “real world”):

From Keras’ own documentation

From Theano’s documentation

From JAX’s documentation

From Tensorflow source code

But hey, if you think you have superior skills relative to Google engineers because you don’t use single-letter variable names, then you are welcome to think that.


Broad ideas about beginners vs. professionals

Here’s a common pattern I see among software engineers and programmers. There are 2 types:

Type 1) Beginners / students who just graduated a bootcamp / students who just graduated college (for brevity, I’ll simply call these “beginners” in this article)

Type 2) Seasoned professionals / those who have experience working on large teams, working on significant-sized projects (for brevity, I’ll call these “professionals”)


The “beginner” approach is often:

  • to be very gung-ho about fixing everything on their first day
  • to have very ambitious ideas about overhauling legacy systems
  • to parrot many of the “rules” they learned in school

Such “rules” include:

  • don’t use single-letter variable names
  • comment your code
  • use spaces over tabs / tabs over spaces / 2-space indents / 4-space indents (you can see there is already a problem because these are all inconsistent with each other!)
  • don’t do premature optimization (unfortunately, as beginners, they have no idea what constitutes “premature optimization” in the first place, because they haven’t spent enough time learning the system first)

The last point reminds me of people who comment in online forums, who often parrot the phrase “correlation does not imply not causation”.

As a statistician, we think “yes yes, we all know that, but you’re missing the point of the discussion”. Statisticians already have this idea solidly implanted in their minds. There’s no need to repeat it at every opportunity.


The professional, more mature approach differs in the following ways:

They are not gung-ho about fixing everything immediately. In large systems, when you change one thing, it can potentially affect many other things you haven’t even thought of.

This is an example of how beginners “don’t know what they don’t know”. It’s like pulling out a Jenga piece and the whole tower falling down as a result.

Professionals also understand that things might seem weird at first but that there is probably a good reason that they are that way.

Real systems are complex and sometimes compromises have to be made. Professionals get this.


Professionals also temper their ambitions to overhaul legacy systems. They are better at estimating how much time things take, compared to beginners. Beginners often feel they can do anything.

Professionals, thanks to experience, actually know what that “anything” turns into once committed.


Finally, professionals generally do not parrot rules like “don’t use single-letter variable names”. Instead, they understand context and convention.

Obviously, calling your API key “x” doesn’t make sense, but calling your model input “x” when that’s also what it’s referred to in the math does make sense.

Obviously, commenting your code makes sense, but if you are taking a course in which the “commentary” for the code consists of the actual video lecture itself, you shouldn’t expect the same comments to be repeated in the code.

Sometimes, what is appropriate in one context is not appropriate in another. Beginners don’t understand this, and try to apply the same rules in all contexts. They cannot adapt.

Professionals understand that it is more important to conform to the conventions and processes of their team, and to be predictable. This is efficient.

Things are easier to understand when everyone does things the same way everywhere.

As a simple example, if your whole team uses 2-space indents, but you start using tabs, you’re going to mess up the repo for everyone else.

If your whole team uses conventional math symbols (e.g. x for inputs, z for latent variables) and you start writing variable names like latent_cluster_identity_probability it’s going to look very weird (especially when everyone already knows what “z” means).

Go to comments

Artificial Intelligence Boxing Day Blowout!

December 26, 2018

Deep Learning and AI Courses for just $11.99

Boxing Day 2018

Celebrate the Holidays with New AI & Deep Learning Courses!

I’ve been busy making free content and updates for my existing courses, so guess what that means? Everything on sale!

For the next week, all my Deep Learning and AI courses are available for just $11.99!

For my courses, please use the coupons below (included in the links), or if you want, enter the coupon code: DEC2018.

For prerequisite courses (math, stats, Python programming) and all other courses, follow the links at the bottom for sales of up to 90% off!

Since ALL courses on Udemy on sale, if you want any course not listed here, just click the general (site-wide) link, and search for courses from that page.
























And just as important, $11.99 coupons for some helpful prerequisite courses. You NEED to know this stuff to understand machine learning in-depth:

General (site-wide): http://bit.ly/2oCY14Z
Python http://bit.ly/2pbXxXz
Calc 1 http://bit.ly/2okPUib
Calc 2 http://bit.ly/2oXnhpX
Calc 3 http://bit.ly/2pVU0gQ
Linalg 1 http://bit.ly/2oBBir1
Linalg 2 http://bit.ly/2q5SGEE
Probability (option 1) http://bit.ly/2p8kcC0
Probability (option 2) http://bit.ly/2oXa2pb
Probability (option 3) http://bit.ly/2oXbZSK



As you know, I’m the “Lazy Programmer”, not just the “Lazy Data Scientist” – I love all kinds of programming!


iOS courses:

Android courses:

Ruby on Rails courses:

Python courses:

Big Data (Spark + Hadoop) courses:

Javascript, ReactJS, AngularJS courses:



Into Yoga in your spare time? Photography? Painting? There are courses, and I’ve got coupons! If you find a course on Udemy that you’d like a coupon for, just let me know and I’ll hook you up!

Go to comments

Neural Ordinary Differential Equations

December 15, 2018

Very interesting paper that got the Best Paper award at NIPS 2018.

“Neural Ordinary Differential Equations” by Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud.

Comes out of Geoffrey Hinton’s Vector Institute in Toronto, Canada (although he is not an author on the paper).

For those of you who have ever programmed simulations of systems of differential equations, the motivation behind this should be quite intuitive.

Recall that a derivative is the same thing as the slope of a tangent line, and can be approximated by the usual “rise over run” formula for small time steps \( \Delta t \).

$$ \frac{dh}{dt} \approx \frac{h(t + \Delta t) – h(t)}{\Delta t}$$

Here’s a picture of that if you forgot what it looks like:


Normally, the derivative is known to be some function \( \frac{dh}{dt} = f(h, t) \).

Your job in writing a simulation is to find out how \( h(t) \) evolves over time.

Here’s a picture of how that works (using different symbols):


Since our job is to find the next value of \( h(t) \), we can rearrange the above to get:

$$ h(t + \Delta t) = h(t) + f(h(t), t) \Delta t $$

Typically the time step is just \( 1 \), so we can rewrite the above as:

$$ h_{t+1} = h_t + f(h_t, t) $$

Researchers noticed that this looks a lot like the residual network layer that is often used in deep learning!

In a residual network layer, \( h_t \) represents the input value, \( h_{t+1} \) represents the output value, and \( f(h_t, t) \) represents the residual.

Here’s a picture of that (using different symbols):



At this point, the question to ask is, if a residual network layer is just a difference equation that approximates a differential equation, can there be a neural network layer that is an actual differential equation?

How would backpropagation be done?

This paper goes over all that and more.

Read the paper here! https://arxiv.org/abs/1806.07366

Go to comments

Deep Learning and Artificial Intelligence Newsletter

Get discount coupons, free machine learning material, and new course announcements