Hacker News new | past | comments | ask | show | jobs | submit login
Efficient Transformer Knowledge Distillation: A Performance Review (arxiv.org)
63 points by PaulHoule on Dec 7, 2023 | hide | past | favorite | 5 comments



This paper combines knowledge distillation and efficient attention mechanisms.

=> It works (still efficient, lower cost).

Not an unexpected result, but to their credit, they established a new benchmark to test these combinations. KD+LongFormer is one of the best ones, retaining 95.9% of the performance for 50.7% of the cost.


Appendix A

A.1 Data Collection Data for GONERD was obtained through Giant Oak’s GONER software, which scraped web ar- ticles from public facing online news sources as well as the U.S. Department of Justice’s justice.gov domain. This webtext data was randomly sampled with an upweighted probability toward documents from justice.gov so that justice.gov consisted of roughly 25% of the total GONERD dataset.


I skimmed the paper but I don't really understand what knowledge generation actually entails.


Don’t mean to be flip at all, may I suggest:

1. Use Gpt-4 with something like this:

“Help me understand what this paper is about and estimate whether the relevance and impact to the field is likely to be low, medium, or high.

Explain jargon that may be specific to AI research, but don’t bother explaining or expanding on terms familiar to a working software developer or basic undergraduate computer science.”

2. Follow the above prompt with either the abstract or the full text of the paper.

3. Post something useful here and save others the time.

Do not - in my opinion - copy/paste LLM output as a comment. But would love to hear your own succinct, human sounding, HN guideline compatible thoughts.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: