Hacker News new | past | comments | ask | show | jobs | submit login
New code-focused LLM needle in the haystack benchmark (github.com/hamminghq)
6 points by sumanyusharma 5 months ago | hide | past | favorite | 1 comment



Hi HN - In collab with UWaterloo, we published a new code-focused needle in the haystack benchmark.

TLDR - GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length. - The hype is real. GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models. - Llama3-70B reached GPT 3.5-Turbo levels (yay open-source!) - Gemini 1.0 Pro was bad across the board (super surprising)

Link to repo: https://github.com/HammingHQ/bug-in-the-code-stack See full results here: https://hamming.ai/blog/bug-in-the-codestack




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: