have a bunch of bounties on it, we're getting 94%+ now! mostly not me who wrote this, see history. have to switch to float16 and add Winograd convs still. we have a branch with multigpu too.
Well, you do seem to be extremely equal and fair on that aspect, so my respect to you for that. Don't be too hard on yourself, please!
I'll keep an eye out on that, anyone working on it can shoot me a message if any snag or questions about particulars come up. Weirdly enough my email is best.
If you're looking for biggest glaring performance the edge over PyTorch, I'd probably note that the MaxPooling is probably where to go, the PT version is extremely slow for some reason, and done properly it should be a simple indexing operation that's fusable in either direction.
If whoever fulfills the bounty can beat me to me writing a custom mega-fused kernel with max pooling, the convs, activation, etc, then y'all have a pretty good shot at taking the WR crown.
https://github.com/tinygrad/tinygrad/blob/master/examples/hl...
have a bunch of bounties on it, we're getting 94%+ now! mostly not me who wrote this, see history. have to switch to float16 and add Winograd convs still. we have a branch with multigpu too.
goal is to beat an A100 in speed on a tinybox.