Couldn't it encode the normal map in the RGB channels of a single extra image, instead of 4 extra images?

That's what it does, grayscale images are used to produce it.

I guess the 4 source images are easier to see and predict the output.

