Hello all,
I have been working on benchmarking different LLMs -- both open-source and closed-source.
Repo: https://github.com/georgian-io/LLM-Finetuning-Hub
Precisely, I am comparing their out-of-the-box capabilities (prompting) and their fine-tuned conterparts!
So far, the following models have been benchmarked:
Open-Source:
- Flan-T5-large (780M)
- RedPajama (3B & 7B)
- Falcon 7B
- Llama2 (7B & 13B)
- Mistral 7B
- Mosaic MPT 7B
Close-Source:
- AI21 Jurassic-2 (Light, Grande, Jumbo)
- Writer Palmyra 30B
- GPT 3.5 Turbo 154B
The following trends have emerged:
- For out-of-the-box zero-shot & few-shot prompting, GPT-3.5-turbo takes the cake! Highly likely that being the biggest model out of all helps with generalizability!
- Even among other closed-source LLMs such as Jurassic-2 and Palmyra, GPT-3.5-turbo wins!
- Open-source models, however, do not fare very well for out-of-the-box tasks. Between all open-source models, Mistral-7B achieves the best stats!
- When it comes to finetuning, things get very competitive! We notice that much smaller models such as Llama2-7B, Mistral-7B, Falcon-7B are able to compete with the likes of GPT-3.5-turbo 154B and Jurassic-2 models.
The last point makes me very hopeful that smaller LLMs when finetuned on narrow use-cases / data can give a tough competition to the much larger models.
I am aiming to benchmark more models including Anthropic Claude2, Google PaLMv2, CoHere Command and Inflection Pi. My hunch is these huge closed-source models will generally perform well out-of-the-box compared to (smaller) open-source models. However, finetuning will change the game, where smaller models will compete or even out-compete the larger models!
Since there are so many LLMs out there, would love to get some help, in case anyone consider contributing :)
https://news.ycombinator.com/showhn.html