I'm not an expert on this subject, does anybody have any insights on this?
My conclusion was I could easily set >99% of weights to zero on my (fully connected) layers with minimal performance impact after enough training, but the training time went up a lot (effectively after removing a bunch of connections, you have to do more training before removing more), and inference speed wasn't really improved because sparse matrices are sloooow.
Overall, while it works out for biology, I don't think it will work for silicon.
The only reason we architect ANNs the way we do is optimization of computation. The bipartite graph structure is optimized for GPU matrix math. Systems like NEAT have not been used at scale because they are a lot more expensive to train and to utilize the trained network with. ASICs and FPGAs have a change to utilize a NEAT generated network in production, but we still don't have a computer well suited to training a NEAT network.
NEAT would totally be competitive if someone actually gets a version running in PyTorch/Tensorflow
It is based on NEAT (as other commenters mentioned) and also ties in some discussion of the Lottery Ticket Hypothesis as you mentioned.