Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: LlamaGym – fine-tune LLM agents with online reinforcement learning (github.com/khoomeik)
239 points by KhoomeiK 11 months ago | hide | past | favorite | 28 comments



I want to make a Discord bot that impersonates all my friends and continues to refine the model as the conversations continue. Basically this [1] post, but with a more modern model and, ideally, reinforcement learning. Seems like this would fit the bill.... Is there anything else that would make this easier?

[1] https://www.izzy.co/blogs/robo-boys.html


You could perhaps adapt the Doppel Bot slack bot from Modal Labs: https://github.com/modal-labs/doppel-bot


From the title I misunderstood what it does. However, now I'm wondering if what I thought is was (don't ask my why I thought it) is possible:

I have a PC that is able to run e.g. Mistral Instruct 7B Q4 inference with around 30 token/s.

How (computation and memory) expensive would it be to also run backpropagation in addition to inference?

I'm aware that the models are typically fed with much more and better data than what is typically provided during normal conversations but on the other hand if I could finetune my local model a teeny tiny bit during during / after each conversation I have with it anyways, it would after a while be perfectly customize for me.

I'm also aware that this could be problematic for models that are used by multiple users but my intended use case would be personal use by a single user.


For an idea of what's possible you might be interested in this story that was just on HN where they fine tune a quantized 70b model in 48GB VRAM: https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html


Very expensive.

AFAIK the model can’t be quantized during backprop, so right there you’d need a ton of RAM.

Backprop is faster bc it can be parallelized, but IIRC you need to hold an entire copy of the model for each backprop process.


Actually, there have been attempts to do quantized backprop, but not sure how successfully.


Thank you for making this. Simplifying any aspect of RL is always welcome.


Thanks! Yeah RL for LLMs is pretty underexplored I think beyond the RLHF stuff. Pretty tough to get working tho.


Didn’t DPO supplant rlhf?


Could someone help me understand the kinds of things you can build with this? Is this like RLHF?


Can this be used outside of OpenAI environments? If yes I think an example would be great!


Gymnasium is now maintained by the Farama Fpundation, an open-source consortium, not OpenAI. But most RL environment work for the past 5+ years has been Gym-compliant. The TextWord example in the repo, for example, instantiates a Gym-style environment but it doesn’t import from Gymnasium (uses textworld.gym instead).



Thanks for making this! Helps simplify it nicely


When 150 lines of boilerplate can land you the first page on HN, maybe it is, in fact, the end of programming?


Karpathy’s micrograd [1] is literally 154 lines. Guess programming ended 4 years ago.

[1] https://github.com/karpathy/micrograd


You think autograd is boilerplate?


Carmack's infamous fast inverse square root was only 13 lines. Measuring code by line metrics rather than its contents reflects shallow, questionable comprehension.


If the first line of Carmack's infamous code was "import fast_inverse_square_root" from Pypi.org, it wouldn't be as impressive.


Let's not be one of those people who measure developer productivity by number of lines


I’m not really sure what your point is. Is it not remarkable that valuable things can be done in 150 lines?


I agree with you. / Above, I wouldn't assume a single nor clearly intended "point". Reading it I got an impression more of concern, even fear. I'm guessing one underlying driver may be a concern that AI is creeping into more and more programming. Which is true.


Interesting project, basically a wrapper too around openai gym-like functionality that can handle open llms.


Yup, it does simplify LLM agent inference on Gym environments but the main technical contribution is reducing your would-be code overhead for online RL


Thanks for creating this!


llamagym.com for sale


Very interesting!


Simplified the concept. Nicely done!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: