Hi HN - In collab with UWaterloo, we published a new code-focused needle in the haystack benchmark.
TLDR
- GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length.
- The hype is real. GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models.
- Llama3-70B reached GPT 3.5-Turbo levels (yay open-source!)
- Gemini 1.0 Pro was bad across the board (super surprising)
TLDR - GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length. - The hype is real. GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models. - Llama3-70B reached GPT 3.5-Turbo levels (yay open-source!) - Gemini 1.0 Pro was bad across the board (super surprising)
Link to repo: https://github.com/HammingHQ/bug-in-the-code-stack See full results here: https://hamming.ai/blog/bug-in-the-codestack