How do you measure tokens/sec? Here's my attempt on a new M4 Max 128GB, does about 6 words/sec:
bash> time ollama run llama3.3 "What's the purpose of an LLM?" | tee ~/Downloads/what\ is\ an\ LLM.txt
A Large Language Model (LLM) is a type of artificial intelligence (AI) designed to process and understand human language. The primary purposes of an LLM are:
(... contents excerpted for brevity)
Overall, the purpose of an LLM is to augment human capabilities by providing a powerful tool for understanding, generating, and interacting with human language.
real 0m59.040s
user 0m0.071s
sys 0m0.081s
pmarreck 59s35ms
20241206220629 ~ bash> wc -w Downloads/what\ is\ an\ LLM.txt
359 Downloads/what is an LLM.txt
LM Studio puts stats at the bottom of each reply like: 2.09 tok/sec, 346 tokens, 1.74s to first token. This was for a 259 word response, so ~ 0.75 words/token. If that ratio holds, you might be getting 8 tok/sec on you M4 Max?
Looks like LM Studio is available for ARM based Macs, if you want to give that a try, that'd be one way to get these stats. LM Studio also surfaces up some parameters to play around with, and keeps a record of past conversations if that might appeal to you.
Just add "--verbose" to your run command, e.g. "ollama run mistral-nemo:latest --verbose", it'll dump the token counts and timing info after each message.