Hacker News new | past | comments | ask | show | jobs | submit login

Yes, some non-linearity is important - not for Turing completeness, but because without it the consecutive layers effectively implement a single linear transformation of the same size and you're just doing useless computation.

However, the "decision point" of the ReLU (and it's everywhere-differentiable friends like leaky ReLU or ELU) provides a sufficient non-linearity - in essence, just as a sigmoid effectively results in a yes/no chooser with some stuff in the middle for training purposes, so does the ReLU "elbow point".

Sigmoid has a problem of 'vanishing gradients' in deep networks, as the sigmoid gradients of 0 - 0.25 in standard backpropagation means that a 'far away' layer will have tiny, useless gradients if there's a hundred sigmoid layers in between.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: