The calendar one is weird to me because the sample questions can just be calculated if you know which year they are asking about. And the popular LLMs are pretty reliable at stuff like "what's the 153rd day of the year 2025" or "what day of the week was New Year's Day in 1953", even without being multimodal.