I can see where that'd help (significantly), but I don't see how a captcha-reader can't also parse the key words, and interpret the image appropriately. "size-est" and "shape" / "color" is easy to rip out, and a basic OCR to detect shapes and sizes and a number is pretty basic.
And remember that the site lists Captchas as being OCR-able, and that the question sentence must be readable - quickly - by humans, so it can only be weakly obfuscated, unlike some of the shorter Captchas. That weakens an already-claimed-as-weak system, so it's easy to assume the instructions can be read. Detecting a couple key words via a smallish dictionary seems simple at worst.
A few mistakes can be ignored, because the server must allow a few - people will make them too. After all of the above, what's the success rate on a dumb interpreter? Say, one that can't understand order of words, merely their existence.
I'm not seeking to totally shoot this down - I think it's an interesting idea, and with a large number of objects and some visual obfuscation could possibly supersede Captcha. But it's achieved by adding logical complexity; to retain ease of use, visual complexity / some other form has to be sacrificed, which makes it easier to read by machines. Or, potentially, making the image quite a bit bigger, so the images / objects used can be made significantly more complex ("click the dog" when given several images).