Hacker News new | past | comments | ask | show | jobs | submit login

Is there a speed-up? In their paper in table 3, once you compare each ALBERT model with the smaller BERT model, you're looking at similar accuracies and longer training times.



They are comparing the speed to execute training to 125K steps, not speed to a given accuracy.

In section 4.8 they compare accuracy at the same amount of training time for the biggest of each model and show that ALBERT is substantially better.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: