From the neural network Wikipedia page: > The signal each neuron outputs is calc...

PeterisP · 2024-05-11T22:20:19

In current neural networks the activation function usually is not sigmoid but something like ReLU ( y = 0 if x<0 else x ), and in any case the computation of activations is not meaningful part of total compute - for non-tiny networks, almost all the effort is spent on the the large matrix multiplications of the large layers before the activation function, and as the network size grows, they become even less relevant, as the amount of activation calculations grows linearly with layer size but the core computation grows superlinearly (n^2.8 perhaps?).

mcguire · 2024-05-12T00:58:20

Really?

It's literally been decades, but the last I learned about neural networks, a non-linear activation function was important for Turing completeness.

PeterisP · 2024-05-12T07:56:25

Yes, some non-linearity is important - not for Turing completeness, but because without it the consecutive layers effectively implement a single linear transformation of the same size and you're just doing useless computation.

However, the "decision point" of the ReLU (and it's everywhere-differentiable friends like leaky ReLU or ELU) provides a sufficient non-linearity - in essence, just as a sigmoid effectively results in a yes/no chooser with some stuff in the middle for training purposes, so does the ReLU "elbow point".

Sigmoid has a problem of 'vanishing gradients' in deep networks, as the sigmoid gradients of 0 - 0.25 in standard backpropagation means that a 'far away' layer will have tiny, useless gradients if there's a hundred sigmoid layers in between.

jszymborski · 2024-05-12T03:39:56

ReLU are indeed non-linear, despite the confusing name.

The nonlinearity (plus at least two layers) are required to solve nonlinear problems (like the famous XOR example).

kragen · 2024-05-12T02:01:00

neural networks aren't turing complete (they're circuits, not state machines) and relu is not just nonlinear but in fact not even differentiable at zero

KeplerBoy · 2024-05-11T19:53:39

Most activations functions are based on exponential functions.

I don't immediately see how cordic would help you calculate e^(x) unless x is complex valued (which it is usually not).

adgjlsfhk1 · 2024-05-11T20:10:07

see section B of https://eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf. Generalized cordic can compute hyperbolic trig functions which gives you exponentials. That said, it's still not useful for ML because it's pretty hard to beat polynomials and tables.

KeplerBoy · 2024-05-11T20:26:06

Interesting, thanks.

Dylan16807 · 2024-05-12T08:18:42

You just need a vague approximation. Even CORDIC is overkill.

clamp(x/2, -1, 1) is basically a sigmoid.