Most benchmarks contain a bunch of examples of a particular task - e.g., each example in an image classification benchmark is an image and its associated class. The approach for doing well on these types of benchmarks has historically been (1) train a large model with (2) lots of data.
However, each item in the ARC benchmark is totally unique task. The network is presented a handful of examples (questions and answers) of the unique task and is asked to complete one instance of the task.
Importantly, the tasks are a secret. The only way that models can “prepare” for ARC is by getting familiar with the public priors of the ARC tasks - e.g., the colored grid world.
As a result, ARC evaluates the ability of models to learn new tasks with limited data at test time. This is a thing humans do very well that models do not (at least up until now).