Papers code is probably optimized for "first to publish". Also overfit in a non-traditional sense since ppl wanta to beat SOTA by as much as possible. Also the heuristic tips an tricks and autotuning you'd want in a production models would exceed paper lenth 10x. Also the author is motivated to NOT provide a bug free easy to put in production version of the code since that would lower the $ value of their expertise. A cocktail of all the wrong incentives!
Probably the production versions of those models are suboptimal in different ways but work better in practice...
Probably the production versions of those models are suboptimal in different ways but work better in practice...