I wonder if you could get a better result by including other factors in the reward function, like trying to maintain a slight forward lean.