Hacker News new | past | comments | ask | show | jobs | submit login
Godot-dodo – Finetuning LLaMA on single-language comment:code data pairs (github.com/minosvasilias)
132 points by minosu on April 23, 2023 | hide | past | favorite | 28 comments



This repository presents finetuned LLaMA models that try to address the limited ability of existing language models when it comes to generating code for less popular programming languages.

gpt-3.5-turbo and gpt-4 have proven to be excellent coders, but fall off sharply when asked to generate code for languages other than Python/Javascript etc. The godot-dodo approach to address this: Finetune smaller models on a single one of these languages, using human-created code scraped from MIT-licensed GitHub repositories, with existing GPT models generating instructions for each code snippet.

This differs from the dataset generation approach used by projects such as stanford-alpaca or gpt4all, in that the output values of the training set remain high quality, human data, while following the same instruction-following behavior. This will likely prove more effective the more obscure the language. In this case, GDScript was used, which is the scripting language for the popular open-source game-engine Godot. The same approach however can be applied to any other language.

Performance is promising, with the 7 billion parameter finetune outperforming GPT models in producing syntax that compiles on first try, while being somewhat less capable at following complex instructions.

A comprehensive evaluation comparing all models can be found here: https://github.com/minosvasilias/godot-dodo/tree/main/models


This sounds like one of those bootstrapping liftoff things. Generating labels had been a big bottleneck, but if we can just find examples and then label them automatically, this could accelerate all sorts of applications.


I'm not sure what MIT licensed code is supposed to do for you. Are you going to cite every repository ingested?


I suppose for the model indeed you should do that?

But then maybe not for the actual predictions made by the model, as the MIT license says:

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Arguably e.g. a single function is not a substantial portion of a multi-file project—and, usually, even that function itself is not going to be a verbatim copy but adjusted to your use case regarding variable names etc.


Technically you could do that in a big text file...


This is fabulous.

Just want to add that there are efforts to impove training speed, like this: https://github.com/Lightning-AI/lit-llama/issues/62

So the practical cost/dataset size for language finetunes is bound to get better rapidly.

EDIT: And there is also this for JAX finetuning. https://github.com/young-geng/EasyLM/blob/main/docs/llama.md


The performance report doesn't describe the loss approached by each of these fine tunings, but I wonder if the number of tokens in the instruction dataset was just not nearly long enough to produce high quality output.

I can't think of any other reason the 13B parameter model would perform worse than the 7B model. Would love to see a deep dive into the fine tuning and more details - by epoch if possible - on the output.


I have seen this same phenomenon mentioned on huggingface: a finetuned large model being worse than its smaller variant.


Thanks for sharing. Why is the training dataset that contains instructions and output wrapped by another enclosing prompt (https://github.com/minosvasilias/godot-dodo/blob/f62b90a4622...)

Why does this even work when during inference this wrapping prompt is absent? Wouldnt the model then work best against a inference prompt that follows the wrapping prompt structure, however the desired outcome is to have a model that just works without the wrapping prompt?

Edit: see reply from OP, the wrapping prompt is used for inference as well, so misunderstanding on my part


The wrapping prompt is also used during inference. (https://github.com/minosvasilias/godot-dodo/blob/f62b90a4622...) Prompting like this is useful for instruct-finetunes, and similar prompts are used by other projects like stanford-alpaca.


Thanks for the clarification, makes sense now!


This is nice work, and it is great to see the effort taken to show the pipeline so this will work for others.

One further extension might be to fine-tune to specifically encourage behavior for a client like godot-copilot. I bet you could teach it to obey your particular prompt structure (eg, matching indentation, or inserting code at the right spot without adding it elsewhere). That would really complete the story to make this very usable by everyday people who dont know/care about LLM internals and fine tunings.


On this page:

https://github.com/minosvasilias/godot-dodo/tree/main/models

It seems that some of the GPT syntax errors appear to be because the models were trained for Godot 3, but the tests were conducted against Godot 4, hence the error messages like "KinematicBody2D does not exist in 4.x (should be CharacterBody2D)".


Yes, the changes introduced by Godot 4 were a prime motivator for this project.

However, it is not quite as clear cut as OpenAI's models simply being trained on Godot 3.x projects only. Not only do they sometimes produce valid 4.x syntax (gpt-4 more often than 3.5-turbo), indicating there were at least some 4.x projects in the training data, they also hallucinate other invalid syntax, such as Python-specific functionality, or simply non-existent methods.

I do think evaluating against Godot 3.x would increase their scores somewhat, but i have not had time to do so yet.


> For finetuning godot_dodo_4x_60k_llama_13b, eight A100 80GB GPUs were used.

$300k of hardware! Guess it answers my previous comment from Hetzner server https://news.ycombinator.com/item?id=35662925


They called out the costs incurred:

  $30 - dataset generation (OpenAI GPT 3.5-turbo)
  $24 - llama 7b fine-tuning (8x A100 80GB instance costs)
  $84 - llama 13b fine-tuning (8x A100 80GB instance costs)


You can customize a model with a "cheap" 3090 (or maybe a 7900 XTX), see here: https://github.com/Lightning-AI/lit-llama#finetune-the-model

What OP did is more intense, but Lora/Adapter still can give excellent results.

Pure AI cards for mere mortals aren't really a thing yet.


Anyone tried fine tunning on CPU already? I expect it to be much slower, but is it even practical?


Also I bet the pytorch training code is written with CUDA semantics. Maybe a JAX version would work without messing with the code.


Oh no not a chance, finetuning is really compute intense.


A100's cost about $4/hour/GPU to rent. So the total cost depends on the amount of time, but 8x for 24 hours would cost $768.


8x A100 80GB are $12/hour or so on Lambda Cloud. If you get lucky enough to snag capacity.


I think that many people could be interested in sharing cost if they can obtain a LLaMA based finetuned model better than GPT4 in their preferred language. So there is an opportunity for someone creating a startup just for that.


There have to be easier ways to share costs of running the same model, no?


Why wouldn't you use the comments in the source code as well as gpt-generated ones?

Edit: just easier to code the training data generation, I imagine.


In the future could we see a models fine-tuned for specializing in every language? Or would a general model outperform?


Finetunes are historically better than the base model, thats the whole idea of a finetune... but I think you are really asking if finetunes of public models will be as good as huge SOTA proprietary models and their paid "finetuning" services.

Thats hard to say.


I will wait for the model when someone trains my language




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: