{"id":11,"date":"2026-05-05T14:32:34","date_gmt":"2026-05-05T14:32:34","guid":{"rendered":"https:\/\/oliverng.com\/ai\/?p=11"},"modified":"2026-05-07T13:51:20","modified_gmt":"2026-05-07T13:51:20","slug":"llama-cpp","status":"publish","type":"post","link":"https:\/\/oliverng.com\/ai\/2026\/05\/05\/llama-cpp\/","title":{"rendered":"llama.cpp"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Being impressed with Cloud LLMs of late, I started to play with local models and assessed llama.cpp on my Mac &#8211; an M4 with 16GB. This was during a time that Gemma4 was just released. I tested the E4B Q4KM model + 8B parameters.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Initial thoughts.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>My Mac has nowhere near enough RAM to run a &#8220;useful&#8221; tool calling model.  Being the whole reason I was impressed with Cloud LLMs, I gotta say&#8230; when you have an under spec&#8217;d system, it is pretty terrible.<\/li>\n\n\n\n<li>Gemma4 kept spinning its wheels when it came to tool calling.  You know this because with local models, you can see the &#8220;thinking&#8221; happening.  And this is way more obvious because  the slow token generation makes it very easy to read out the thinking part.  Unlike Cloud LLMs that speed through it.<\/li>\n\n\n\n<li>I was using the latest Gemma4 template (earlier ones had tool calling bugs) and the google recommended configuration parameters but the only reliable tool call was web search, and just barely.  <\/li>\n\n\n\n<li>It is probably also due to my top model option for 16GB host being an 8B parameter model but Google advertised Gemma4 as being capable of this even at lower ranges.<\/li>\n\n\n\n<li>On my M4, it was generating about 27 tokens\/s.  It felt bad.  I hear that even having the latest and greatest GPUs maybe gives twice that? which I would expect still feels behind cloud model speed.<\/li>\n\n\n\n<li>This allows me to understand why there is such a push towards compute and memory in tech today.  It is the currency to faster LLM results.<\/li>\n\n\n\n<li>For the purposes of a chat bot, Gemma4 is very good and 27tok\/s was reasonable if you were to ask a question, walk away, and come back a minute later.  So if privacy and security are important, running a local LLM for chatbot reasons seems just fine.  <\/li>\n\n\n\n<li>I asked a lot of questions that I usually would ping Gemini for  and it performed well.  I was surprised that I constantly hit Gemma4&#8217;s safety guardrail.  It came up a lot more than expected.  For example, I asked about the safety of chemicals and it instantly told me it couldn&#8217;t advise.<\/li>\n\n\n\n<li>To which I then loaded up Qwen3.5 which had no concerns about safety with the same question.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Being impressed with Cloud LLMs of late, I started to play with local models and assessed llama.cpp on my Mac &#8211; an M4 with 16GB. This was during a time that Gemma4 was just released. I tested the E4B Q4KM model + 8B parameters. Initial thoughts.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-11","post","type-post","status-publish","format-standard","hentry","category-experimenting"],"_links":{"self":[{"href":"https:\/\/oliverng.com\/ai\/wp-json\/wp\/v2\/posts\/11","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/oliverng.com\/ai\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/oliverng.com\/ai\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/oliverng.com\/ai\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/oliverng.com\/ai\/wp-json\/wp\/v2\/comments?post=11"}],"version-history":[{"count":4,"href":"https:\/\/oliverng.com\/ai\/wp-json\/wp\/v2\/posts\/11\/revisions"}],"predecessor-version":[{"id":16,"href":"https:\/\/oliverng.com\/ai\/wp-json\/wp\/v2\/posts\/11\/revisions\/16"}],"wp:attachment":[{"href":"https:\/\/oliverng.com\/ai\/wp-json\/wp\/v2\/media?parent=11"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/oliverng.com\/ai\/wp-json\/wp\/v2\/categories?post=11"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/oliverng.com\/ai\/wp-json\/wp\/v2\/tags?post=11"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}