We (at Anyscale) have benchmarked GPT-4 versus the Llama-2 suite of models on a ...

We (at Anyscale) have benchmarked GPT-4 versus the Llama-2 suite of models on a few problems: functional representation, SQL generation, grade-school math question answering.

GPT-4 wins by a lot out of the box. However, surprisingly, fine-tuning makes a huge difference and allows the 7B Llama-2 model to outperform GPT-4 on some (but not all) problems.

This is really great news for open models as many applications will benefit from smaller, faster, and cheaper fine-tuned models rather than a single large, slow, general-purpose model (Llama-2-7B is something like 2% of the size of GPT-4).

GPT-4 continues to outperform even the fine-tuned 70B model on grade-school math question answering, likely due to the data Llama-2 was trained on (more data for fine-tuning helps here).

https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...