
New ERNIE language representation tops BERT in Chinese NLP tasks - rococode
https://medium.com/syncedreview/baidus-ernie-tops-google-s-bert-in-chinese-nlp-tasks-d6a42b49223d
======
rococode
Source code:
[https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE](https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE)

It seems that the primary difference here is that ERNIE generates a different
set of data for the masked LM task that BERT trains on. Rather than masking
words arbitrarily it does some preprocessing with some tagging tool to
identify segments that can be masked (my Chinese is rusty so this may not be
totally accurate).

I believe the intuition here is that BERT somewhat expects words to be
relatively distinct units of meaning since it masks words individually, but
this assumption doesn't hold for Chinese where "words" (characters) are more
frequently grouped together to form meaning. I feel this could be applied to
English to a lesser extent though, curious if anyone has tried doing a similar
thing.

