
Benchmarking deep learning activation functions on MNIST [OC] - rickdeveloper
https://heartbeat.fritz.ai/benchmarking-deep-learning-activation-functions-on-mnist-3d174e729735
======
Hawkenfall
A more in-depth paper about this found the Swish activation often outperformed
other functions:
[https://arxiv.org/abs/1710.05941](https://arxiv.org/abs/1710.05941)

~~~
osipov
Most of the recent research is moving to GELU (Gaussian Error Linear Units)
activation functions:
[https://arxiv.org/pdf/1606.08415.pdf](https://arxiv.org/pdf/1606.08415.pdf)

------
albertzeyer
These are small differences, and this is on MNIST (very small toy dataset).
This is likely just noise. How big is the variance when each experiment is
tried with different random seeds? And more interestingly, how about more
difficult problems? E.g. try out on some real world tasks, like e.g. speech
recognition (e.g. Librispeech). I don't think you can draw any conclusion from
these current results.

~~~
rickdeveloper
Thanks for the advice!

