Hacker News new | past | comments | ask | show | jobs | submit login

This paper partially finds disagreeing evidence: https://arxiv.org/abs/2403.17887



Good reference. I actually work on this stuff day-to-day which is why I feel qualified to comment on it, though mostly on images rather than natural language. I'll say in my defense that work like this is why I put a little disclaimer. It's well-known that plenty of popular models quantize/prune/sparsify well for some tasks. As the authors propose "current pretraining methods are not properly leveraging the parameters in the deeper layers of the network", this is what I was referring to as the networks not being "at capacity".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: