Hacker News new | past | comments | ask | show | jobs | submit login
Electra: Pre-Training Text Encoders as Discriminators Rather Than Generators (2020) (arxiv.org)
65 points by luu 9 months ago | hide | past | favorite | 9 comments



LOL, I was reading the abstract and remembering there used to be a paper like that. Then I look at the title and see it was from 2020. For a moment I thought someone plagiarised the original paper.

Unfortunately BERT models are dead. Even the cross between BERT and GPT - the T5 architecture (encode-decoder) is rarely used.

The issue with BERT is that you need to modify the network to adapt it to any task by creating a prediction head, while decoder models (GPT style) do every task with tokens and never need to modify the network. Their advantage is that they have a single format for everything. BERT's advantage is the bidirectional attention, but apparently large size decoders don't have an issue with unidirectionality.


BERT and T5 models are slowly consuming the computational biology field, so they certainly aren't dead to all.


It helps that you can pretty easily frame a bidirectional task in a directional way. For example, fill in the middle tasks.

You can have a bidirectional model directly fill in the middle...

Or you could just frame that as a causal task by giving the decoder llm a command to fill in the blanks, and the entire document with the sections to fill replaced by a special token/identifier all as input, and the model is trained to output the middle sections along with their identifier.

There we go, now we have a causal decoder transformer that can perform a traditionally bidirectional task.


BERT is alive and well for most commercial uses of NLP.

If you're running 100k QPS through the model with a budget of 0.1 cents per query, you aren't going to be using a GPT model for classification.


BERT isn't dead for smaller tasks (think NER, Sentiment Analysis) where low latency is needed.


There’s also articles for pre-training BERT models on hardware resources a small lab could afford. Those are still useful, too, even if not highly competitive. So, they could still have value for low-cost, small, model development.


Good work by well-known reputable authors.

The gains in training efficiency and compute cost versus widely used text-encoding models like RoBERTa and XLNet are significant.

Thank you for sharing this on HN!


(2020)


Reminds somewhat parallel from the classic expert systems - human experts shine at discrimination, and that is one of the most efficient methods of knowledge eliciting from them.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: