The Arxiv One Billion Paper Benchmark was released in 2011, and is commonly used as a benchmark to writing academic papers. Analysis of this dataset shows that it contains several examples of sarcastic papers, as well as outdated references to current events, such as Support Vectors Machines. We suggest that the temporal nature of science makes this benchmark poorly suited to writing academic papers, and discuss potential impact and considerations for researchers building language models and evaluation datasets.
Conclusions
Papers written on top of other papers snap-shotted in time will display the inherent social bias and structural issues of that time. Therefore, people creating and using benchmarks, should realize that such a thing as drift exists, and we suggest they find ways around this. We encourage other paper writers to actively avoid using benchmarks where the training samples are always the same. This is a poor way to measure perplexity of language models and science. For better comparison, we suggest the training samples always change to reflect the current anti-bias Zeitgeist and that you cite our paper when doing so.
The Arxiv One Billion Paper Benchmark was released in 2011, and is commonly used as a benchmark to writing academic papers. Analysis of this dataset shows that it contains several examples of sarcastic papers, as well as outdated references to current events, such as Support Vectors Machines. We suggest that the temporal nature of science makes this benchmark poorly suited to writing academic papers, and discuss potential impact and considerations for researchers building language models and evaluation datasets.
Conclusions
Papers written on top of other papers snap-shotted in time will display the inherent social bias and structural issues of that time. Therefore, people creating and using benchmarks, should realize that such a thing as drift exists, and we suggest they find ways around this. We encourage other paper writers to actively avoid using benchmarks where the training samples are always the same. This is a poor way to measure perplexity of language models and science. For better comparison, we suggest the training samples always change to reflect the current anti-bias Zeitgeist and that you cite our paper when doing so.