Mixture-of-Depths: Dynamically allocating compute in transformer language models

andy12_ · 2024-04-04T10:48:11

> MoD transformers demonstrate the value of routing among different types of computations. In this work the types were either the conventional transformer block, or a null computation (functionally equivalent to multiplying by zero). However, one can imagine extending this idea further by routing between even more types of computation. For example, perhaps some tokens are routed to "memory lookup" functions, and others are routed to "tool use" functions.

This is one of the most interesting parts of the paper for me. It seems that this could be a much more efficient approach for tool use and math calculation.

zone411 · 2024-04-04T09:18:45

Pretty surprising that Google DeepMind is still publishing such papers unless they have something much further ahead already. I expect OpenAI and Anthropic to have their equivalents to this research, but it still lets others catch up more easily and makes LLMs closer to commodities.