I just ran phi3:mini[1] with Ollama on an Apple M3 Max laptop, on battery set to "Low power" (mentioned because that makes some things run more slowly). phi3:mini output roughly 15-25 words/second. The token rate is higher but I don't have an easy way to measure that.
Then llama3:8b[2]. It output 28 words/second. This is higher despite the larger model, perhaps because llama3 obeyed my request to use short words.
Then mixtral:8x7b[3]. That output 10.5 words/second. It looked like 2 tokens/word, as the pattern was quite repetitive and visible, but again I have no easy way to measure it.
That was on battery, set to "Low power" mode, and I was impressed that even with mixtral:8x7b, the fans didn't come on at all for the first 2 minutes of continuous output. Total system power usage peaked at 44W, of which about 38W was attributable to the GPU.
Then llama3:8b[2]. It output 28 words/second. This is higher despite the larger model, perhaps because llama3 obeyed my request to use short words.
Then mixtral:8x7b[3]. That output 10.5 words/second. It looked like 2 tokens/word, as the pattern was quite repetitive and visible, but again I have no easy way to measure it.
That was on battery, set to "Low power" mode, and I was impressed that even with mixtral:8x7b, the fans didn't come on at all for the first 2 minutes of continuous output. Total system power usage peaked at 44W, of which about 38W was attributable to the GPU.
[1] https://ollama.com/library/phi3 [2] https://ollama.com/library/llama3 [3] https://ollama.com/library/mixtral