Deep Learning: The Swish Activation Function

The Google Brain team has just released a new paper (https://arxiv.org/abs/1710.05941) that demonstrates the superiority of a new activation function called Swish on a number of different neural network architectures.

This is interesting because people often ask me, “which activation function should I use?”

These days, it is common to just use the ReLU by default.

To refresh your memory, the ReLU looks like this:

relu

And it is defined by the equation:

$$ f(x) = max(0, x) $$

One major problem with the ReLU is that its derivative is 0 for half the values of the input \( x \). Because we use “gradient descent” as our parameter update algorithm, if the gradient is 0 for a parameter, then that parameter will not be updated!

In other words, when I do:

$$ \theta = \theta – \alpha \frac{\partial J}{\partial \theta } $$

And:

$$ \frac{\partial J}{\partial \theta } = 0 $$

Then my update is just:

$$ \theta = \theta $$

Which just assigns the parameter back to itself.

This leads to the problem of “dead neurons”. Experiments have shown that neural networks trained with ReLUs can have up to 40% dead neurons!

There have been some proposed alternatives to this, such as the leaky ReLU, the ELU, and the SELU.

Interestingly, none of these have seemed to catch on and it’s still ReLU by default.

 

 

So how does the Swish activation function work?

The function itself is very simple:

$$ f(x) = x \sigma(x) $$

Where \( \sigma(x) \) is the usual sigmoid activation function.

$$ \sigma(x) = (1 + e^{-x})^{-1} $$

It looks like this:

Screen Shot 2017-10-18 at 2.39.55 PM

What’s interesting about this is that unlike every other activation function, it is not monotonically increasing. Does it matter? It seems the answer is no!

The derivative looks like this:

Screen Shot 2017-10-18 at 3.29.34 PM

One interesting thing we can do is re-parameterize the Swish, in order to “stretch out” the sigmoid:

$$ f(x) = 2x \sigma(\beta x) $$

We can see that, if \( \beta = 0 \), then we get the identity activation \( f(x) = x \), and if \( \beta \rightarrow \infty \) then the sigmoid converges to the unit step and multiplying that by \( x \) gives us back \( f(x) = 2 max(0, x) \) which is just the ReLU multiplied by a constant factor.

So including \( \beta \) is a way for us to nonlinearly interpolate between identity and ReLU.

The title of the paper is “A Self-Gated Activation Function”, which might make you wonder, “Why is it self-gated?”

This should remind you of the LSTM, where we have “gates” in the form of sigmoids that control how much of a vector gets passed on to the next stage, by multiplying it between the output of the sigmoid, which is a number between 0 and 1.

So “self-gated” means that the gate is just the sigmoid of the activation itself.

Gate: \( \sigma(x) \)

Value to pass through: \( x \)

But that’s enough theory. For most of us, we want to know: “Does it work?”

And more practically, “Can I just use this by default instead of the ReLU?”

The best thing to do is just to try it for yourself and see how robust it is to different settings of hyperparameters (learning rate, architecture, etc.) but let’s look at some results so we can be confident when it comes to using Swish:

Screen Shot 2017-10-18 at 3.42.46 PM

Click on the image to see it in the original size.

To compare Swish with baseline, a statistical test called the one-sided paired sign test was used.

Conclusion: Try Swish for yourself!