This is harder than it appears. "House": [h] [ɒ] [ʊ] [s] [iː] (or whatever pronunciation they teach in phonics) ... [hɒʊsiː] isn't an English word. Or maybe it is, but you don't not it yet, because you're just beginning to learn to read and there are many words you don't know. But probably it's some word that sounds similar. "Horsie" [hɔɹsi]? Pretty close, that's probably it.
Of course there are additional rules that can help disambiguate (e.g. "e" at end is frequently silent), but a beginner isn't going to know them all. So telling them to think about whether the word they came up with fits with the context and using pictures to help with error correction isn't terribly wrong. Having the teacher intervene when the kid misreads something may be better, but it requires the teacher to be present in the first place.
Where the three-cue method fails seems to be in the order of presentation. By showing them the picture first, the kids learn to guess the text without reading at all. If the book is full of sentences like "Look at the X." next to a picture of X, you don't need to be able to read to figure out what the text next to the caterpillar is going to say. According to my flawed understanding of cognition, this is going to condition the kids to think that the picture is a more reliable predictor of what they have to read out loud than the letters on the page, so they're going to focus their attention on that.
If on the other hand the picture were on the next page, you could still use it to confirm you read correctly, while guaranteeing that the predictor-predicted relationship doesn't draw attention to the wrong place.
I wasn't suggesting that "house" vs "horse" is a particularly difficult distinction, but rather that just looking at the letters is not enough. Kids really do need those additional rules, and until they've learned them, some kind of error correction is necessary.