Do what do attribute to the gains? Adaptive clipping? Or $$$ spent on NAS??

rsfern · on Feb 15, 2021

They do a little study in section 4.1 comparing batchnorm to adaptive gradient clipping for resnets over a range of hyperparameters, and they also compare perf to batchnorm versions in table 6. The results indicate AGC does give a real boost over batchnorm

They do a bunch of manual hyperparameter tuning that seems necessary to get the state of the art results, from my reading it doesn’t seem like they actually used NAS. Just that the baseline they compare to was found with NAS