I was wondering why Figure 1 showed a HumanEval score of 61.6 for Qwen2.5-Coder-7B, but Table 1 shows a score of 88.4, i. e. better than this new model with a score of 66.5.
The reason is that those are actually two different models (Qwen2.5-Coder-7B-Base with 61.6, Qwen2.5-Coder-7B-Instruct with 88.4).
The reason is that those are actually two different models (Qwen2.5-Coder-7B-Base with 61.6, Qwen2.5-Coder-7B-Instruct with 88.4).