Category: Experimenting

  • Antigravity 2.0

    Google just dropped Antigravity 2.0 and ripped out the VSCode fork. I had been using AG since its early preview because it had extremely generous free Gemini caps within the IDE, all with a VSCode interface. With 2.0, it seems, they just decided to be Claude Code with a Gemini wrapper.

    So 2.0 = Claude Code. It has scheduled tasks, connectors, MCP, projects and a primarily chat interface. Guess Google decided Anthropic won and with Codex doing the same, they couldn’t fall behind. They put a CLI behind it, and added projects, worktrees everything in Claude Desktop. And moved the IDE into another product.

    It was inevitable because I think Anthropic proved, this interface is the right interface for most people. VSCode was never going to be the UX that my wife uses. Claude Desktop while having its own confusing UX issues, is a understandable means to connect with, and interact with an LLM. So here we are. Microsoft is probably next with Copilot. It’s about time. Copilot is a hot mess. So much so that even Microsoft employees do not want to dogfood it.

  • Non-determinism

    There is a weird shift in how you go from a world where coding is a very structured and deterministic thing, to vibe coding. As a coder you might write a script. Every time you run that script the same thing happens. every. time.

    When working with LLMs it is easy to forget that this thing is not what you think it is. It has structure like code, it has guardrails like code, it can even write structured code with guardrails! But run it enough times and you realize that it is not always doing the exact same thing.

    And I find as a coder, it can be hard to change that way of thinking. I want to think of an LLM like a new abstraction layer over code that works the exact same way.

    I have a family dashboard I created with Claude. It has three parts. Schedule, Homework and Meals. Every week on Sunday, it pulls each of our schedules, extracurriculars, extracts my kids calendar from their teachers handouts, overlays their important events for the week and generates a week view. I also work with my kids on their homework. I generate homework exercises as per the school curriculum and commit to memory progress. I ask Claude to write new homework into a set of daily one-pagers I can easily print out for the kids. Lastly, I get Claude to randomly select favourite recipes from my note repo, extract all the ingredients, put it back into my note repo so I have a grocery list every Monday. It works great. But it didn’t always.

    At first, after a bunch of prompt testing I had a prompt that would run and generate what I thought was perfect. And next week, something would look off. It could be the schedule was in rows, rather than columns. I could be that their homework would have 5 questions instead of 15. It could be that the recipe cards I ask it to create no longer fit on one line.

    My point to all of this is to say that as a coder, I was treating LLMs like code, but they’re kind of like people. Being non-deterministic, every time they do something they may do it slightly differently. That is unless you force some structure.

    And so I learned, if you need something the same every single time, and often for output, that is what we want… you have to use a combination of prompting and coding knowledge to force the LLM into a structure. This could be by forcing it to deploy a script (which is deterministic) or having a template with built-in sentinels. Frankly these quirks feel odd but thinking about how we are coding using just our natural language alone, the trade-off seems fair.

  • thrift and tokens (printing press)

    Everyone knows token economy is 💰

    I came across this new abstraction layer for integrating external tooling into LLMs, called printingpress.dev.

    From an API spec, from a website with no public API, from a beloved community fan project – one command prints a token-efficient Go CLI, a Claude Code skill, an OpenClaw skill, and an MCP server. Peter Steinberger showed the way with discrawl and gogcli: a local SQLite mirror beats a remote API call, compound commands beat ten round trips, and an agent-native CLI beats raw HTTP. The press bakes that playbook into every binary it prints. Muscle memory for agents.

    It uses custom compiled CLI saving valuable token exchange commonly seen with MCP, connectors etc. MCP traffic is heavy. Part of why my exploration into local LLMs stopped was that I realized how much context an MCP exchange takes.

    Similar to how exa, tavily MCPs clean up garbage from the web to provide LLM clean search, printingpress goes a step further and forgoes the whole MCP exchange for a CLI interface that runs locally and does all the dirty work more efficiently, saving your tokens.

    The beauty of it is there is also a prompt kit that helps generate brand new CLIs from any service. So point it at a service and watch it go.

  • llama.cpp

    Being impressed with Cloud LLMs of late, I started to play with local models and assessed llama.cpp on my Mac – an M4 with 16GB. This was during a time that Gemma4 was just released. I tested the E4B Q4KM model + 8B parameters.

    Initial thoughts.

    • My Mac has nowhere near enough RAM to run a “useful” tool calling model. Being the whole reason I was impressed with Cloud LLMs, I gotta say… when you have an under spec’d system, it is pretty terrible.
    • Gemma4 kept spinning its wheels when it came to tool calling. You know this because with local models, you can see the “thinking” happening. And this is way more obvious because the slow token generation makes it very easy to read out the thinking part. Unlike Cloud LLMs that speed through it.
    • I was using the latest Gemma4 template (earlier ones had tool calling bugs) and the google recommended configuration parameters but the only reliable tool call was web search, and just barely.
    • It is probably also due to my top model option for 16GB host being an 8B parameter model but Google advertised Gemma4 as being capable of this even at lower ranges.
    • On my M4, it was generating about 27 tokens/s. It felt bad. I hear that even having the latest and greatest GPUs maybe gives twice that? which I would expect still feels behind cloud model speed.
    • This allows me to understand why there is such a push towards compute and memory in tech today. It is the currency to faster LLM results.
    • For the purposes of a chat bot, Gemma4 is very good and 27tok/s was reasonable if you were to ask a question, walk away, and come back a minute later. So if privacy and security are important, running a local LLM for chatbot reasons seems just fine.
    • I asked a lot of questions that I usually would ping Gemini for and it performed well. I was surprised that I constantly hit Gemma4’s safety guardrail. It came up a lot more than expected. For example, I asked about the safety of chemicals and it instantly told me it couldn’t advise.
    • To which I then loaded up Qwen3.5 which had no concerns about safety with the same question.