While this is true, there are ways to test (open models) on tasks created after the model was released. We see good numbers there as well, so something is generalising there.
While this is true, there are ways to test (open models) on tasks created after the model was released. We see good numbers there as well, so something is generalising there.