Hacker News new | past | comments | ask | show | jobs | submit login

As far as I can can tell though the core idea is the same, to focus on the differences, the implementation is different. Differential transformers 'calculates attention scores as the difference between two separate softmax attention maps'. So they must process the redundant areas. This removes them altogether, which would significantly reduce compute. Very neat idea.

However, I do think that background information can sometimes be important. I reckon a mild improvement on this model would be to leave the background in the first frame, and perhaps every x frames, so that the model gets better context cues. This would also more accurately replicate video compression.




Actually, I was mislead by the video example. They do actually keep the background information they use a temporal encoding so that the information is propagated through. Very interesting and well thought out




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: