Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

we were thinking about doing exactly this, the closest current work is probably the amazing "Learning Formal Mathematics from Intrinsic Motivation" by Poesia et al (they use constraints too increase the likelihood of generating correct theorems/proofs during RL)

https://arxiv.org/abs/2407.00695




Yes, RL works well in fields where answer can be verified in different degree. That's why AlphaGo success, it also should work in code generation and math.


Your reward function can simply be the distance between the constrained output and the unconstrained output, that way you won't even need synthetic data, just a dataset of prompts to RL against.


How to get "unconstrained output" and evaluate the distance between them?

Evaluation method which can decide distance between two sentences is hard to find, best option is closed-source LLM API even it's not the most ideal option. As a result, we also must use current LLM to improve our models.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: