This repository presents finetuned LLaMA models that try to address the limited ability of existing language models when it comes to generating code for less popular programming languages.
gpt-3.5-turbo and gpt-4 have proven to be excellent coders, but fall off sharply when asked to generate code for languages other than Python/Javascript etc.
The godot-dodo approach to address this: Finetune smaller models on a single one of these languages, using human-created code scraped from MIT-licensed GitHub repositories, with existing GPT models generating instructions for each code snippet.
This differs from the dataset generation approach used by projects such as stanford-alpaca or gpt4all, in that the output values of the training set remain high quality, human data, while following the same instruction-following behavior. This will likely prove more effective the more obscure the language. In this case, GDScript was used, which is the scripting language for the popular open-source game-engine Godot. The same approach however can be applied to any other language.
Performance is promising, with the 7 billion parameter finetune outperforming GPT models in producing syntax that compiles on first try, while being somewhat less capable at following complex instructions.
This sounds like one of those bootstrapping liftoff things. Generating labels had been a big bottleneck, but if we can just find examples and then label them automatically, this could accelerate all sorts of applications.
I suppose for the model indeed you should do that?
But then maybe not for the actual predictions made by the model, as the MIT license says:
> The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
Arguably e.g. a single function is not a substantial portion of a multi-file project—and, usually, even that function itself is not going to be a verbatim copy but adjusted to your use case regarding variable names etc.
The performance report doesn't describe the loss approached by each of these fine tunings, but I wonder if the number of tokens in the instruction dataset was just not nearly long enough to produce high quality output.
I can't think of any other reason the 13B parameter model would perform worse than the 7B model. Would love to see a deep dive into the fine tuning and more details - by epoch if possible - on the output.
Why does this even work when during inference this wrapping prompt is absent? Wouldnt the model then work best against a inference prompt that follows the wrapping prompt structure, however the desired outcome is to have a model that just works without the wrapping prompt?
Edit: see reply from OP, the wrapping prompt is used for inference as well, so misunderstanding on my part
This is nice work, and it is great to see the effort taken to show the pipeline so this will work for others.
One further extension might be to fine-tune to specifically encourage behavior for a client like godot-copilot. I bet you could teach it to obey your particular prompt structure (eg, matching indentation, or inserting code at the right spot without adding it elsewhere). That would really complete the story to make this very usable by everyday people who dont know/care about LLM internals and fine tunings.
It seems that some of the GPT syntax errors appear to be because the models were trained for Godot 3, but the tests were conducted against Godot 4, hence the error messages like "KinematicBody2D does not exist in 4.x (should be CharacterBody2D)".
Yes, the changes introduced by Godot 4 were a prime motivator for this project.
However, it is not quite as clear cut as OpenAI's models simply being trained on Godot 3.x projects only. Not only do they sometimes produce valid 4.x syntax (gpt-4 more often than 3.5-turbo), indicating there were at least some 4.x projects in the training data, they also hallucinate other invalid syntax, such as Python-specific functionality, or simply non-existent methods.
I do think evaluating against Godot 3.x would increase their scores somewhat, but i have not had time to do so yet.
I think that many people could be interested in sharing cost if they can obtain a LLaMA based finetuned model better than GPT4 in their preferred language. So there is an opportunity for someone creating a startup just for that.
Finetunes are historically better than the base model, thats the whole idea of a finetune... but I think you are really asking if finetunes of public models will be as good as huge SOTA proprietary models and their paid "finetuning" services.
gpt-3.5-turbo and gpt-4 have proven to be excellent coders, but fall off sharply when asked to generate code for languages other than Python/Javascript etc. The godot-dodo approach to address this: Finetune smaller models on a single one of these languages, using human-created code scraped from MIT-licensed GitHub repositories, with existing GPT models generating instructions for each code snippet.
This differs from the dataset generation approach used by projects such as stanford-alpaca or gpt4all, in that the output values of the training set remain high quality, human data, while following the same instruction-following behavior. This will likely prove more effective the more obscure the language. In this case, GDScript was used, which is the scripting language for the popular open-source game-engine Godot. The same approach however can be applied to any other language.
Performance is promising, with the 7 billion parameter finetune outperforming GPT models in producing syntax that compiles on first try, while being somewhat less capable at following complex instructions.
A comprehensive evaluation comparing all models can be found here: https://github.com/minosvasilias/godot-dodo/tree/main/models