I agree that learning to reason about the world likely does not require experience with motor control and proprioception (i.e. literally how babies do it), though I do think that you either need at least some sort of tempo-spatial experience (e.g. visual). Tempo-spatial representations are just extremely hard to convey by text only. You might get the idea of closeness by saying 'close is when two words are close in a sequence of words' and ordering by saying 'this word comes after that word', but I think it would be very difficult to extrapolate that concept to more than one dimension, and dimensions that actually have not just an ordering but also a metric (just think about our inability to reason about just the fourth dimension). You need rich representations of our 3+1 dimensional world to be able to reason about it and text only gives you perhaps "0.5" dimensions (because it lacks a metric, i.e. it does not convey durations in terms of the ticks of the recurrent network). But I doubt, too, that interaction with the world is necessary, in fact I think it would be rather easy for an AI to simply write motor programs in a programming language given unrestricted and noiseless memory once it has learned to reason about tempo-spatial patterns from just observing them and identifying them with our language-coded shared concept space. It is not constrained to real-time performance of actions as humans are, therefore it can take the much easier way of programming any interaction with the world as needed, on the fly. Our shared concept space likely sufficiently conveys our general (common sense) knowledge how these patterns are known to interact and evolve over time once rudimentary tempo-spatial representations are in place.

tl;dr I think you need a small set of tempo-spatially grounded meanings (though not necessarily agent-related) and you can bootstrap everything from that using only textual knowledge.

