New code-focused LLM needle in the haystack benchmark

sumanyusharma · 2024-05-23T02:23:55 1716431035

Hi HN - In collab with UWaterloo, we published a new code-focused needle in the haystack benchmark.

TLDR - GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length. - The hype is real. GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models. - Llama3-70B reached GPT 3.5-Turbo levels (yay open-source!) - Gemini 1.0 Pro was bad across the board (super surprising)

Link to repo: https://github.com/HammingHQ/bug-in-the-code-stack See full results here: https://hamming.ai/blog/bug-in-the-codestack