
BLEU Score: Bilingual Evaluation Understudy - keyboardman
https://leimao.github.io/blog/BLEU-Score/
======
Der_Einzige
BLEU and ROUGE scores are almost useless for what they are trying to do
(measure how good translation or summarization systems are). The author shows
a great example of an excellent translation getting a 0 score.

I tried submitting a NLP paper where I explicitly laid out my reasons for
avoiding evaluating my system with ROGUE scores and I learned really quickly
that despite having a terrible scoring metric, the NLP community would rather
reject all papers without ROGUE scores reported rather than admit that there
is an incredible lack of methods for automatically evaluating summaries or
translations.

How many good translation or summarization ideas are not published or utilized
just because they don't get high BLEU scores? I bet it's a lot of them...

~~~
gok
The point of BLEU (and ROUGE and METEOR) is to correlate with human grading.
It's unusual to find cases where a MT model change increases BLEU but hurts
human ratings.

~~~
gas9S9zw3P9c
It's not so unusual in my experience. I used to do research in this area and
we ran side-by-side human evals with our experiments. Yes, huge jumps in BLEU
typically imply better human ratings, but the variance is big. Many times,
significant jumps in BLEU had no significant effect on human ratings. ROUGE is
even more useless.

You just don't see these things reported in the literature because 1. human
evals are a pain to run and can be expensive 2. researchers need to publish or
perish.

------
mintrain
There are many ways of computing BLEU. Researchers may use different
tokenization, byte pair encoding or raw text processing methods which will
affect the final score.

sacrebleu package helps to standardize BLEU computation to make the comparison
of model performance easier.
[https://github.com/mjpost/sacrebleu](https://github.com/mjpost/sacrebleu)

