LOL, I was reading the abstract and remembering there used to be a paper like that. Then I look at the title and see it was from 2020. For a moment I thought someone plagiarised the original paper.
Unfortunately BERT models are dead. Even the cross between BERT and GPT - the T5 architecture (encode-decoder) is rarely used.
The issue with BERT is that you need to modify the network to adapt it to any task by creating a prediction head, while decoder models (GPT style) do every task with tokens and never need to modify the network. Their advantage is that they have a single format for everything. BERT's advantage is the bidirectional attention, but apparently large size decoders don't have an issue with unidirectionality.
It helps that you can pretty easily frame a bidirectional task in a directional way. For example, fill in the middle tasks.
You can have a bidirectional model directly fill in the middle...
Or you could just frame that as a causal task by giving the decoder llm a command to fill in the blanks, and the entire document with the sections to fill replaced by a special token/identifier all as input, and the model is trained to output the middle sections along with their identifier.
There we go, now we have a causal decoder transformer that can perform a traditionally bidirectional task.
There’s also articles for pre-training BERT models on hardware resources a small lab could afford. Those are still useful, too, even if not highly competitive. So, they could still have value for low-cost, small, model development.
Reminds somewhat parallel from the classic expert systems - human experts shine at discrimination, and that is one of the most efficient methods of knowledge eliciting from them.
Unfortunately BERT models are dead. Even the cross between BERT and GPT - the T5 architecture (encode-decoder) is rarely used.
The issue with BERT is that you need to modify the network to adapt it to any task by creating a prediction head, while decoder models (GPT style) do every task with tokens and never need to modify the network. Their advantage is that they have a single format for everything. BERT's advantage is the bidirectional attention, but apparently large size decoders don't have an issue with unidirectionality.