llama.cpp

Written by

Being impressed with Cloud LLMs of late, I started to play with local models and assessed llama.cpp on my Mac – an M4 with 16GB. This was during a time that Gemma4 was just released. I tested the E4B Q4KM model + 8B parameters.

Initial thoughts.

My Mac has nowhere near enough RAM to run a “useful” tool calling model. Being the whole reason I was impressed with Cloud LLMs, I gotta say… when you have an under spec’d system, it is pretty terrible.
Gemma4 kept spinning its wheels when it came to tool calling. You know this because with local models, you can see the “thinking” happening. And this is way more obvious because the slow token generation makes it very easy to read out the thinking part. Unlike Cloud LLMs that speed through it.
I was using the latest Gemma4 template (earlier ones had tool calling bugs) and the google recommended configuration parameters but the only reliable tool call was web search, and just barely.
It is probably also due to my top model option for 16GB host being an 8B parameter model but Google advertised Gemma4 as being capable of this even at lower ranges.
On my M4, it was generating about 27 tokens/s. It felt bad. I hear that even having the latest and greatest GPUs maybe gives twice that? which I would expect still feels behind cloud model speed.
This allows me to understand why there is such a push towards compute and memory in tech today. It is the currency to faster LLM results.
For the purposes of a chat bot, Gemma4 is very good and 27tok/s was reasonable if you were to ask a question, walk away, and come back a minute later. So if privacy and security are important, running a local LLM for chatbot reasons seems just fine.
I asked a lot of questions that I usually would ping Gemini for and it performed well. I was surprised that I constantly hit Gemma4’s safety guardrail. It came up a lot more than expected. For example, I asked about the safety of chemicals and it instantly told me it couldn’t advise.
To which I then loaded up Qwen3.5 which had no concerns about safety with the same question.

llama.cpp

Comments

Leave a Reply Cancel reply

More posts

Cowork Mobile

A day of Claude workflows

On-device LLMs in iOS 27

You token maxxin’ too?