It's also worth checking out existing neural net code-bases to see what tricks they have. The fine details usually aren't in papers, and they're not all in the text-books either.
The first potential problem that jumped out at me in this code was the initialization:
self.weights = [np.array()] + [np.random.randn(y, x)
for y, x in zip(sizes[1:], sizes[:-1])]
I'd multiply those initial weights by a small constant divided by the square-root of weights going into the same neuron. For multiple layers you might consider layer-by-layer pre-training. For other architectures, like recurrent nets, definitely find a reference on how to do the initialization.
PS I would definitely add a test routine to check that the gradients from back-propagation agree with a finite difference approximation. It's so easy to get gradient code wrong, and it's so easy to test.
Given that you are a person who is highly-qualified to answer, I am genuinely curious why do you think that is? Reimplementing algorithms from scratch is an efficient way to learn, understand the underlying concepts and attempt improvements in a research context.
That said, there are also whole papers, even collected volumes, on initialization and other practical details.
Textbooks aren't always up-to-date with the latest practical knowledge, as deep-learning practice is moving quickly. Or they simply don't want to clutter their high-level maths descriptions with code-level implementation details. Teaching stuff is all about tradeoffs. I'm sure several books do mention the scale of weights for simple feed-forward weights though, as it's not an implementation-level detail, and it's probably been well known since the 1980s.
As for textbooks, I imagine that the field is moving too fast; half the stuff I use has only existed for the past year or two.
I do not want in any way to sound critical and am genuinely curious about the dynamics of why people would find this interesting given it's reduced complexity.
Disclaimer: I'm such a developer! (currently going through the last bits of https://www.coursera.org/learn/machine-learning) - but I've noticed other around me recently.
The code is actually based on the original code from the book (e.g.: can be seen from the variable names like 'nabla') , but written in a more succinct manner.
Since I am relatively new to Python, I found it easier to follow this repo's code than the code in the book and used it as my reference implementation.
It's missing quite a few things like calculating accuracy, regularization, etc. but they are quite straightforward to implement.
 Neural Networks and Deep Learning by Michael Nielsen - http://neuralnetworksanddeeplearning.com
At least, that's why i clicked the link.
I think the Udacity course is best if you know principles of machine learning and want to apply them in a more professional toolchain and learn Tensorflow
Given that it's intended to introduce to beginners how nnets work, the choice of activation is an aside anyway - the real meat is back/forwardprop.
The softmax_regression and logistic regression examples are even easier.
There are bindings for nodejs, python, other languages.
But it's so nice to be able to follow the definitions of each symbol and function in visual studio, not to mention being able to step through the imperative code.
And it's fast.
*I work at ArrayFire.
I've started trying to get a network to recognise different vowels ("aaahhhhhh", "eeeeee", "ooooooo", etc.). Relatively easy to generate data - you just need your voice and a microphone. Downside is all the NN systems are much more set up for images than sound.
Or what about neural net fingerprint recognition. There must be databases of fingerprints somewhere. Or irises.
Recognise a type of wood from images of its grain?
Or activity recognition from accelerometer data. I think Pebble recently open sourced their recogniser and it was surprisingly not a neural network. I'm sure a neural network could do better. Might be hard to get a decent amount of data here but this could be a good incentive to do exercise!