I tend to wonder - beyond "ease of implementation and good 'nuff" reasons - if there are other reasons to use RELU, over other activation functions like TANH or Sigmoid?
I'm beginning to suspect that we may be seeing the "engineering side" of neural networks coming into play; that instead of using the more "biologically accurate" activation of the sigmoid function, we instead use RELU (and other ELU derivatives) because it works well, and is easier to understand?
Much like how things progressed better in heavier-than-air flight once engineers realized that flapping wings weren't absolutely needed, and low-weight engines turning propellers, with fixed wings, worked better for flying than what nature uses...?
That said, a lot of the history of neural networks has been brief moments of biological inspiration followed by hacking and engineering that drifts further away from the biology the better it gets. The biggest example is backpropagation; despite how essential it's been to artificial neural networks, it really doesn't exist in the brain, at least not as simply as it does in code. For now, we're all still exploring, some looking towards biology, some towards abstract principles, and it remains to be seen if one provides consistently better results.
IIRC, Monte Carlo Tree Search with "dumber" heuristics than NNs yielded amateur dan level AIs for the first time (somewhere around 2006?). Lately there has also been some AIs that bolt a NN in, and get around 1 stone stronger (which is still miles away from AlphaGo!).
But since this is specifically modelled after AlphaGo, I wonder how it fares against other AIs.
