I'm testing it on a 3-layer perceptron, so memory is less of an issue, but __slots__ seems to speed up the training time by 5%! Pushed the implementation to a branch: https://github.com/noway/yagrad/blob/slots/train.py
Unfortunately it extends the line count past 100 lines, so I'll keep it separate from `main`.
I have my email address on my website (which is in my bio) - don't hesitate to reach out. Cheers!