Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How do you measure tokens/sec? Here's my attempt on a new M4 Max 128GB, does about 6 words/sec:

    bash> time ollama run llama3.3 "What's the purpose of an LLM?" | tee ~/Downloads/what\ is\ an\ LLM.txt
    A Large Language Model (LLM) is a type of artificial intelligence (AI) designed to process and understand human language. The primary purposes of an LLM are:
(... contents excerpted for brevity)

    Overall, the purpose of an LLM is to augment human capabilities by providing a powerful tool for understanding, generating, and interacting with human language.

    real  0m59.040s
    user  0m0.071s
    sys 0m0.081s

    pmarreck  59s35ms
    20241206220629 ~ bash> wc -w Downloads/what\ is\ an\ LLM.txt
        359 Downloads/what is an LLM.txt


LM Studio puts stats at the bottom of each reply like: 2.09 tok/sec, 346 tokens, 1.74s to first token. This was for a 259 word response, so ~ 0.75 words/token. If that ratio holds, you might be getting 8 tok/sec on you M4 Max?

Looks like LM Studio is available for ARM based Macs, if you want to give that a try, that'd be one way to get these stats. LM Studio also surfaces up some parameters to play around with, and keeps a record of past conversations if that might appeal to you.


Yeah, I already use it along with Ollama, I just didn't notice those stats I guess!


Just add "--verbose" to your run command, e.g. "ollama run mistral-nemo:latest --verbose", it'll dump the token counts and timing info after each message.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: