1. Sample actions from a random policy distribution.
2. Fit an inverse model with supervised learning from this data. Inverse models learn to map current observations and next observations to the action which produced the next observation: f(s_t, s_t+1) -> a_t
3. Use reinforcement learning to fit a policy which varies the next observation towards a goal: p(s_t) -> s_t+1
4. Use new data from attempts with the policy and inverse model working together to continue training the inverse model.
Motor babbling is a quick way of generating data but it isn't particularly efficient. The problem with taking random actions is that most of your data is going to cover parts of the state space that aren't important for the task. The addition of the policy allows biasing future attempts towards more useful areas of the state space to continue training the inverse model.
This paper  also includes a forward and inverse model to improve sample efficiency for more examples of these ideas.
No programming necessary, just throw in a bunch of variable settings for bone sizes, lengths, and weights, spawn hundreds of dinosaurs firing random muscle movements, and breed only the ones that manage to walk the most distance by pure luck with each generation, until you get descendants who are very good at walking thousands of generations later.
And here's a genetic walker you can grow in your own browser: