Yes, RL works well in fields where answer can be verified in different degree. That's why AlphaGo success, it also should work in code generation and math.
Your reward function can simply be the distance between the constrained output and the unconstrained output, that way you won't even need synthetic data, just a dataset of prompts to RL against.
How to get "unconstrained output" and evaluate the distance between them?
Evaluation method which can decide distance between two sentences is hard to find, best option is closed-source LLM API even it's not the most ideal option. As a result, we also must use current LLM to improve our models.