
Nubia: A SoTA evaluation metric for text generation - aliabd
https://wl-research.github.io/blog/
======
pheme1
Based on your paper, it seems you didn't compare to other model based
evaluation method such as Frechet Embedding Distance (FED). I would certainly
like to see correlation study between each model evaluation method.

~~~
aliabd
Thanks for your feedback, while we couldn't compare with all model-based
methods out there, YiSi (August 2019) and BERTScore which was just presented
at ICLR2020 this week (April 27th 2020) are very strong methods to compare
against and reflect state-of-the-art. All comparisons are welcome!

------
moinnadeem
What do you think about the train test discrepancy? ie. will practitioners
have to fine-tune Nubia's models on their training dataset in order to
evaluate on their test dataset?

~~~
aliabd
The aggregators in Nubia are pretrained to correlate with human judgement, so
it should only be used for inference, but the idea is that you can use it as a
loss function to optimize translation/image captioning/summarization. It’s too
big for that as is but thats what we’re working towards.

~~~
mhkane
I think the question here is more along the lines of "If now, I have ,say,
radiology reports, do I use Nubia out of the box or do I need to make it read
radiology reports and have a sense of what high quality radiology reports look
like before using it?"

~~~
aliabd
oh I see thanks! will clarify.

------
mhkane
Super cool! If all the components of this metric are models, thoughtful
versioning and documenting it will be important. What are your thoughts on
this ?

~~~
aliabd
This is actually something we struggled with, and why we released everything
this way. There is something to say about why BLEU and ROUGE have stood the
test of time, they are extremely simple to use and static. Hopefully we can
bridge this gap.

