Why do LLMs (or rather similar models that draw pictures) keep getting the numbe...

Why do LLMs (or rather similar models that draw pictures) keep getting the number of fingers on the human hand wrong, or show two people's arms or legs merging? Or in computer-created videos, fail at object preservation? It seems to me they do not have a model of the world, only an imperfect model of pictures they've seen.