You've pretty much nailed it, yes, that and not hashing the level of the child hashes internally, you can construct a file which pretends to be upper hashes. That is potentially not just collidable but actually second-preimagable, given what we saw with the much older MD4-based ones - and they used SHA-1, which wasn't a great idea either! (Although, it should be noted, in (2009) - could a mod mark the headline such?)
The file size being there does complicate an attack - but with the weaknesses in SHA-1, I certainly wouldn't feel comfortable with it.
This is a disaster of a spec, we already had TTH at this point and that at least did it better: it needed revising and should not be implemented by anyone.
Today, you should consider using BLAKE2b's tree hash for this purpose. It walks all over this construct from every direction.
I do really like the BLAKE2b hash, but I've been concerned about actually using it in practice (although recently I had an application which it would have suited very well).
I'm worried that, having failed to win the SHA-3 contest it will end up relegated into obscurity, and using obscure hashing functions isn't usually a great idea.
Is this a valid concern, or am I placing too much weight in the NIST process?
The file size being there does complicate an attack - but with the weaknesses in SHA-1, I certainly wouldn't feel comfortable with it.
This is a disaster of a spec, we already had TTH at this point and that at least did it better: it needed revising and should not be implemented by anyone.
Today, you should consider using BLAKE2b's tree hash for this purpose. It walks all over this construct from every direction.