this is incredible but I can't help but feel like we're skipping some important ...

this is incredible but I can't help but feel like we're skipping some important steps by working with plain text and bitmaps - "a collection of glasses sitting on a table" sometimes eyeglasses sometimes drinking glasses sometimes a weird amalgamation. and as long as we're OK with ambiguity in every layer are we really ever going to be able to meaningfully interrogate the reasons behind a particular output?