
Code Walkthrough of Bert with PyTorch for a Multilabel Classification in NLP - prabhatjha
https://engineering.wootric.com/when-bert-meets-pytorch
======
rsmith49
For fine tuning BERT onto a specific domain, what amount of text data would
you recommend to train on?

~~~
yashvijay
Since the training is done on the tasks of masked word prediction and
contiguous sentence prediction, I'd suggest about a million sentences (from
the same domain), with an average token length of 7 per sentence. Longer
sentences would definitely help, as BERT uses the transformer encoder
architecture which has multi head attention. This would enable the model to do
better contextual representation learning for the embedding layer.

