Category: Experimenting

Cowork Mobile
Cowork can now be steered and accessed on mobile. Previously, cowork ran on a local OS container, with no remote access. That said, Anthropic has done a terrible job explaining how this all works.

Here’s a top ten list of my experience.
1. Any projects that you were using for cowork will not migrate over to cloud if they already access local files on disk. There is no obvious migration tool to make it work via cloud.
  - I found the best way to migrate was to create a new project in cowork, and then add the same local folder.
  - You’ll see this in your cowork projects list. There is a drop down for configuring the project to run in the cloud and it is disabled for existing projects involving folder access.
  - By creating a new project, it gives Claude the ability to re-establish the permissions it needs access a local folder + run in the new cloud format.
2. If you are working on a local cowork project, your conversations will not sync over to mobile. It is the most confusing part of this whole roll-out. Some conversations sync, others do not.
3. The easiest way to know if your cowork is working in the cloud, is that there is a little icon on the top left that shows a cloud icon. Same concept as Claude code. If you see a cloud, Claude is working from a cloud container. If you see a computer, Claude is local to your desktop.
4. The benefit of working in cowork mobile is that you can leverage your MCP connectors and local folders you’ve approved, and start a new task away from your computer. This is different from Claude Chat because while chat has some connectors, it is not MCP level and cannot work with files.
5. Claude cowork cloud very slow. Like, sssssssllllllooooooowww. I had cowork local tasks running snappy and when I moved them cloud-native, it got rough. Part of this is that managing files across MCP connectors is brutal. If you are familiar with using local CLI based connectors like google workspace-CLI where Claude Cowork could directly call a CLI, you can’t do that in a cloud container. You are reliant on the cloud container having access to the MCP connectors you’ve pre-configured. And that means MCP inefficiencies.
6. Part of the benefit of running cowork was direct file access and manipulation. File access is easy because Cowork would just pull the file into the mount path of its local container. In this remote cloud paradigm, that is no longer as easy. I did a test to move my folder into Google Drive, thinking it could replicate my local folder structure on disk. I immediately ran into problems.
  - It was slow. I know I said this before, but man, it is so bad.
  - MCP connectors are limited when it comes to write. Google Drive for example, lets you create files, but no update. You have to reupload each time.
  - Lack of MCP connector exposure to easily copy a file. The cloud container had to direct stream a file byte-by-byte over MCP, with the model as the translator. 500K tokens for a 500K file. Brutal.
  - In the end, I reverted back to local only for these projects because running everything through a kneecapped MCP was a terrible experience.
  - I think MCPs are the weakest link here. I wonder if we could get faster tooling on the cloud container, like the ability to have the googleworkspace-CLI installed an built into the container to improve speeds and capabilities rather than forcing us to go over MCP.
7. The hybrid cloud/desktop approach seems to be the goal-post here. Steer from mobile but access data from local disk via Claude Desktop. Occasionally start tasks from mobile where the ask is not heavy or time-bound. This requires your desktop to be open but that was always the case with Cowork scheduled tasks.
  - It is incredibly unclear what I can and cannot do at any moment. It is not clear, what folders I have access to from my session, it is not clear how I start a session to give myself folder access, it is not clear how to access folder enabled sessions from my mobile. All things that are being advertised.
8. On multiple occasions, the Cloud container just stopped responding. Like out of 30 tasks, around 5 just stopped where I was wondering where that conversation went, only to revisit it and notice it says “Starting .. 1s, 2s” all over again. This is after being prompted over 10m ago. Chalk it up to being in BETA.
9. It all feels like a worse Claude Code remote-control. I can kind of feel the friction the product owners were likely working through when they were building this out going, well what about this problem scenario, or this one, or that one.
  - For example, if I start a cowork session from the mobile client, even if I’m back on my desktop, I cannot then add folders or files to the discussion. I have to start a brand new chat from my desktop. friction.
10. All said, it’s an improvement from where we were before. I can access Cowork from my phone, where as yesterday, I had to go home and unlock my desktop. So that’s a win. And I expect they will iterate from here given that cowork is a very popular product. Unlike Dispatch? Can’t remember the last time that got some love since it was announced.
July 17, 2026

A day of Claude workflows

I tried to come up with a scenario to try out Claude Dynamic Workflows, a deterministic way to manage processes for subagents. Claude recommends you use them for large scale activities that benefit from multiple agents. I wanted to understand how it is different from Claude Agent Teams, which, is easy to forget also exists.

One noticeable difference, is that agent teams shares context within the group. Workflow agents do not and act independently on the prompt. Workflows are also programmatic. Their sequence of actions are managed through python code + prompts. This allows you to deterministically define workflow and results.

I thought it would be fun to take advantage of this knowledge and use it as a way for AI to reason about security risk.

A common argument in Cybersecurity, is risk. You have a vulnerability, we have to quantify the risk. Not always trivial. Fights ensue. People say that it will never happen. They say the architecture does not allow it. They say controls exist. Everyone comes with their independent views. I took these horrible memories of my career, and built an agent workflow that could debate risk – in hopes I would never have to live it again.

Kidding aside, it was a fun experiment. LLMs are great at grounding themselves into a specific viewpoint, taking context, and providing a result. So I made a roundtable of tech personalities that can repeatedly play out my worst nightmares.

The Panel

The roles I chose were a Security Analyst, a Business Owner, and an Architecture Lead. These three signify typical friction I’ve seen where each debate from their own area of accountability.

Security Analyst: Argues for stronger controls and higher severity where evidence supports it. Accountable for what happens if the risk is realized. They are normalized by ensuring they are not simply debating that all risk is critical risk but to be defensible and grounded.

Business / Product Owner: Optimizes for time-to-market, user friction, and opportunity cost. Argues for acceptance, deferrals, or compensating controls. Accountable for the cost of not shipping, including security costs of users staying on worse legacy systems.

Architecture: What can be built and run, and what each control costs in complexity, blast radius, and operational burden. Often the tie-breaker — surfaces that a control Security wants may be infeasible as stated but achievable in another form.

I thought it would be interesting to have additional roles that clean up the debate and tie-break.

Chair: Summarize key debate items. Takes no side. Translates the risk debate items from technical language into common human language. This Chair will help summarize the “agent talk” so that I can understand what’s being said and debated.

Skeptic: Acts to be “that guy” that only has negative things to say. Will debate the specific controls, proposals, risks to find counter-points against the role proposing the item. It reinforces a sanity check to re-think on ideas. Protects against LLM “making stuff up and confirming their own behaviour”. Adds rigor to analysis.

The workflow

When running the workflow, the idea is to provide a risk statement to the panel. As with all LLMs, the more context, the better. I provide a scenario, describing the system, then provide a security finding and recommendation. It’s like an everyday scenario, where a security guy walks into a room with the product team and goes over findings. I gave it something like below (shortened).

Scenario: a cloud-native multi-tenanted SaaS product, generates customer reports with regulated PII, and store these reports in a cloud storage container.

Risk Question: A control requiring tenant-wide customer-managed keys (CMK) is not in place.  Should we block the product team from going live with platform-managed keys, or risk-accept it.

I like this question because I find the use of CMK contentious. It is one of those items where so many people feel it is over-engineering security, given that we are encrypting encrypted data. But everyone typically agrees that a regulatory bodies will not accept that a “platform” holds your encryption key and is a party to the regulated data.

The workflow ran and the outcome of the Agent debate was entertaining.


Security.  Block.  

Rationale: One key compromise = full tenant exposure in a regulated industry.  At least block until production team  commits to implementation date.

Trade-off: Understands CMK is ops work, but won't back down due to regulated data.

Business. Accept.

Rationale: Revenue is blocking on go-live.  Feels this is a LOW risk as platform managed key still encrypts data at rest.  Referenced a compensating control of customer having a contract in place, IP-allowlisting, per-customer tokens and 24h data purge.

Trade-off: Delayed revenue for a quarter, acknowledged case relies on allow-listing to be implemented per customer as contract states.

Architecture. Partial. Tie-break.

Rationale: Agrees with security but agrees with product that while tenant-wide CMK is not feasible in the time period, an alternative of store specific CMK is doable.

Trade-off: PMK acceptance today will be costly later down the line with full CMK.

I love this.

It is like playing The Sims, watching agents argue with each other about my typical security life. Except I have zero stakes in the discussion.

Seeing the arguments reflected in appropriate standpoints shows me how I can use LLMs to help iterate on decisions and be more open minded in my analysis. There’s more and this is where the role of the skeptic and chair come in.

Skeptic.  Refute the lowering of risk.

Rationale: States that there is no proof that the allow-list is in place.  The product agent stating that the risk is lowered, as a compensating control, is unproven.  The panel needs to demonstrate the allow-list is in place with configuration evidence.

Chair.

Surfaced: the allow-list is a contentious issue and will change the risk between a medium vs high.  It should be confirmed.

Proposal: Recommend CMK on the storage today, with full CMK later. Architecture's proposal is a good middle ground.

I like how the skeptic played its role in calling out a common scenario where teams will say “Well, we have this in place.” and then security to say “Show me. I didn’t see that in my data.” And the chair also played its role in not providing a stance, but to give that “management view” of analyzing the whole debate and saying, well this makes sense, this is the pragmatic recommendation.

The table the workflow generated as a summary. A nice touch was it generated a new security requirement too:

Summary	Do not accept PMK as-is. Take Architecture’s middle path: enable CMK on the export store now, and defer only per-tenant key isolation. The risk is High today because the likelihood-reducing controls Business proposed are unverified.
Rating	High – likelihood Medium x impact High. Would drop to Medium only if the IP allow-list enforcement is confirmed.
Recommended decision	Mitigate by compensating control (CMK-on-store + SAS + 24h purge) now; defer per-tenant key isolation to next quarter.
REQ-NEW	Export store must enforce IP allow-listing.

I should add that I added a risk rating rubric to ensure that risks were following a standard and deterministic method. The Agent Workflow is programmed to use this rubric when evaluating risk across all sub-agents.

Closing Thoughts

Long post. I really enjoyed this experiment. There is a lot more that can be done here too. We could have additional agents that go out and search for real-world examples of security issues being debated, CVEs, more personas, etc.

Finally, I should add that while this was an experiment for Claude Workflows, Agent Teams can also execute the same task. In fact, my workflow allows for both a “Quick” mode using Agent Teams and a “Full” mode using Workflows.

Agent teams, will use:

shared context
parallel agents
repeatable via best-effort basis
recommended output formats.
cheaper

Workflows, will use:

separate agent contexts
structural independence
deterministic, auditable, repeatable
dedicated agent pass by the skeptic
outputs enforced by schema.
expensive, as it will spawn multiple agents + overhead orchestration agents

The auditable part of Claude Workflows is unique. Workflows give you a structured run journal, that provides a view of exactly what was done and when, compared to agent teams, that only provides a conversation.

Workflows is more costly. My experimenting on Opus 4.8 has shown about 35% cost overhead but depending on how important having a deterministic, agent orchestrated run is to you, that may be acceptable.

July 14, 2026

On-device LLMs in iOS 27
Apple posted a nice WWDC demo of how local LLMs will work on iOS27.

I’ve been building a lot of iOS apps, replacing all the main apps with ones that are super personalized. Apps I have full control over. One of them is a news aggregator super-app. I was growing tired of opening each news site’s app, and while it’s nice to get AI generated morning summaries of news, it was inflexible compared to what a native app can do – think video, audio, article refinements.

I wanted my app to surface articles and topics that resonated with me and I knew that required a model to run calculations across topics of interest. While I landed on using NLEmbeddings to link up my interactions with my interests, embeddings are simple and vector based. I kept thinking how much more I could do with the benefits of a local LLM.

As I was searching on new possibilities, this demo video from WWDC26 came up on YouTube. It is a language learning app. Point your camera at an object, get a Mandarin vocab card generated entirely on device. On device, open-source models are doing the LLM work. SAM 3 for image segmentation, Qwen 0.6B for language generation.

The video answered a lot of my curiosity for how local models would work in iOS27. I am pleasantly surprised:
- No server, no API costs, no data leaving the device. Private by default, which is the straight up benefit of being local. But it’s cool that I can download an entire edge model on the device.
- Ahead-of-time compilation. Pre-compile on your dev machine so that users are not waiting for model loads. If you’ve run a local model on Mac, you know the pain of waiting for a model to load into memory.
- Background asset delivery. Models download only if the user opts into the feature. This is obvious given the space that models take up, but it is good knowing the SDK acknowledges it.
- Same code runs on Mac with a larger model. Swap Qwen 0.6B for 8B and you get better reasoning, longer context, etc. Another nice touch, in doing a swap based on system capability.
I’m excited to see the SDK for iOS27 and building for it in the near term. And maybe it’s just me, but I’m blown away seeing Qwen mentioned in a live demo, on an iPhone, from Apple.
June 26, 2026
You token maxxin’ too?
It is unsurprising to see organizations scrambling to control AI costs. I am spoiled by subsidized tokens. On Claude I select Opus 4.8, set thinking to MAX, and run my prompt … because I can. On Claude Max or GPT 5.5 Pro, the limits are so generous you feel like by not doing it, you are wasting quota. 10x FTW.

In the current token economy I do not see a rationale to downgrade my model for a prompt to save tokens. Default and go. Times are changing. AI companies are pushing organizations into usage based billing. GitHub copilot changed its policy last month. And people were pissed. Makes sense. You started at $50-100 a month with subsidized usage. Subsidy is gone, and your bill is $2000-$5000.

Will AI companies push everyone to usage based billing? Unlikely but possible that compute is less generous to the point that it’s noticeable. Similar to what we get with Claude Pro vs Max. To save on cost, will people want to invest in local LLMs? Microsoft is reportedly looking into Deepseek to reduce token costs. I think what’s next is a few things.
1. We get better with token optimization. Select a GPT 4, when 5.5 isn’t required. Start pinching pennies on our prompts and use the dials when appropriate.
2. Local models for appropriate tasks. While local models that code need beefy systems, professional office work can be serviceable with models < 16B parameters. My experience is that context switching models is not intuitive.
3. Maybe this is solved by a system like OpenRouter where all models are accessible and can be programmatically chosen depending on the task. Because to expect me to hit a drop-down select every time I prompt or mid-context, is just not happening.
4. Frontier models keep getting better to the point that older models become more affordable and “fine” for everyday coding. The problem here is that compute is limited. If we look at token costs for Claude, over the past year, the cost per token has not decreased for older Sonnet or Opus versions – but between models there is a notable cost difference. But compute is compute, the cost premium is on the model and effort, so affordability is gained via product choice, not a compute matter.
5. Local models for everything. Unlikely with the current token economics of memory. Increasing unified memory from 16 GB to 48 GB on a MBP puts a $2K laptop at $4K. This is the bare minimum needed to run a 32B parameter model. And you will likely still swap. 64 GB MBP forces you into Ultra territory at $5K. The trade-off is real for businesses where a $5K laptop may save under usage based billing scenarios.
In the consumer world, frontier AI companies hungry for fresh data are fighting for users. The cost subsidies will continue. There is talk of a price war now that models are mostly commodities. Google kicked it off by making Gemini AI Pro half the cost while giving a lot of Google Premium Perks. It made me take a hard look at my setup. But the Claude harness is too good right now.

If I was forced into usage based billing, I would have to look at my setup and make hard decisions for how to optimize. I would likely start by switching Frontier companies. I often need to go over to GPT or Gemini when I hit my usage on Claude Max. And.. for most of what I do it is fairly interchangeable because they’ve all copied Claude’s harness which improves user choice.
June 19, 2026
Antigravity 2.0

Google just dropped Antigravity 2.0 and ripped out the VSCode fork. I had been using AG since its early preview because it had extremely generous free Gemini caps within the IDE, all with a VSCode interface. With 2.0, it seems, they just decided to be Claude Code with a Gemini wrapper.

So 2.0 = Claude Code. It has scheduled tasks, connectors, MCP, projects and a primarily chat interface. Guess Google decided Anthropic won and with Codex doing the same, they couldn’t fall behind. They put a CLI behind it, and added projects, worktrees everything in Claude Desktop. And moved the IDE into another product.

It was inevitable because I think Anthropic proved, this interface is the right interface for most people. VSCode was never going to be the UX that my wife uses. Claude Desktop while having its own confusing UX issues, is a understandable means to connect with, and interact with an LLM. So here we are. Microsoft is probably next with Copilot. It’s about time. Copilot is a hot mess. So much so that even Microsoft employees do not want to dogfood it.

May 20, 2026
Into the rabbit hole of a rabbit hole
If you are a Claude Code regular you know that using it in terminal is the way to go. What seems intimidating becomes second nature over the desktop app. But unlike the desktop app which organizes your session views nicely, Code has always been at the mercy of your tmux client and terminal sessions. Super easy to start scrambling for screen real estate with all your terminal windows and tabs.

This week, Claude released Agent View for Claude Code. By hitting the left arrow on your keyboard in a Code session, you can get into an overview screen of your Claude agents. From here, you can spawn agents to do simultaneous work. Your agents can start working on features, planning enhancements, building tests, or bug fixing – all at once. Without opening another Claude Code terminal window. It leverages this by spawning multiple git worktrees on the current working folder so that agents do not interfere with each other.

If you’re in an existing Claude Code window, you can even background that that so that the session stays alive once you close terminal. Even crazier, is that you can have multiple terminal windows open and go from terminal (1), jump into agent view, to jump into the Claude session from terminal (2).

What manages this, is what Anthropic is calling the Supervisor process. It’s like a motherboard for your agents that remembers state. The whole thing is very freeing as where you once needed to have multiple terminal windows open to work on the same tasks, you now can have just one terminal and agent view.

But can I just say with a chuckle that, I feel terminal was not designed for being blown into a multi-windowed-multi-paned orchestration engine. It’s getting a bit out of hand. I often get lost in what exactly this window is showing me as I fall into a rabbit hole of terminals within terminals.

I have by my count lost my way many times staring at a black terminal window trying to recollect where in the matrix I am. Because I have:
- Claude Code running in every terminal window.
- cmux managing my terminal workspaces where some workspaces have multiple 2-up or 3-up panes.
- Claude agent view allowing me to spawn 3X more sessions in the background hidden from view.
- Claude agent view allowing me to move within different terminal windows without leaving my terminal window
- git branches where I have to remember what branch I had pulled and what working folder I’m in because every feature is usually on a new branch
- git worktrees, which the agent view will use because agent view has to use worktrees so as not to conflict in the working folder
- git worktree branches which… same as above
Insane.
May 20, 2026
Non-determinism

There is a weird shift in how you go from a world where coding is a very structured and deterministic thing, to vibe coding. As a coder you might write a script. Every time you run that script the same thing happens. every. time.

When working with LLMs it is easy to forget that this thing is not what you think it is. It has structure like code, it has guardrails like code, it can even write structured code with guardrails! But run it enough times and you realize that it is not always doing the exact same thing.

And I find as a coder, it can be hard to change that way of thinking. I want to think of an LLM like a new abstraction layer over code that works the exact same way.

I have a family dashboard I created with Claude. It has three parts. Schedule, Homework and Meals. Every week on Sunday, it pulls each of our schedules, extracurriculars, extracts my kids calendar from their teachers handouts, overlays their important events for the week and generates a week view. I also work with my kids on their homework. I generate homework exercises as per the school curriculum and commit to memory progress. I ask Claude to write new homework into a set of daily one-pagers I can easily print out for the kids. Lastly, I get Claude to randomly select favourite recipes from my note repo, extract all the ingredients, put it back into my note repo so I have a grocery list every Monday. It works great. But it didn’t always.

At first, after a bunch of prompt testing I had a prompt that would run and generate what I thought was perfect. And next week, something would look off. It could be the schedule was in rows, rather than columns. I could be that their homework would have 5 questions instead of 15. It could be that the recipe cards I ask it to create no longer fit on one line.

My point to all of this is to say that as a coder, I was treating LLMs like code, but they’re kind of like people. Being non-deterministic, every time they do something they may do it slightly differently. That is unless you force some structure.

And so I learned, if you need something the same every single time, and often for output, that is what we want… you have to use a combination of prompting and coding knowledge to force the LLM into a structure. This could be by forcing it to deploy a script (which is deterministic) or having a template with built-in sentinels. Frankly these quirks feel odd but thinking about how we are coding using just our natural language alone, the trade-off seems fair.

May 15, 2026
Claude Routines

Since Claude released routines, I’ve had a blast finding ways to automate my code. I have a laundry list of personal projects in my GitHub repository that I work on with Claude, daily. I regularly have 5+ Claude code sessions coding away on random thoughts and prototypes all at once. So when routines released, I wondered how different it would be over scheduled tasks for cowork. The beauty for me is in the tool calls.

I use Claude Cowork to run a daily financial market analysis and refine a daily thesis that’s ready in the morning. Scheduled runs prompting at a given time and day are a part of cowork that I love. Cronjob with more intelligence.

Claude routines can use my GitHub integration to run everything in the cloud without my machine being on. It pulls my project code, uses a managed instance to run everything and pushes it back to Git after it’s done. It operates like a coding partner for me at night.

Just like that, I now have a bunch of Claude Code routines that trigger nightly, review my project codebase for issues, propose enhancements. Every morning I end up with a list of bug fixes and proposed enhancements. I recently changed my routine to straight up pick an enhancement to work on so when I get up the feature is ready for PR. Claude documents the change, performs security and dependency reviews, and summarizes the change every night. My projects are slowly building themselves as I direct projects rather than code them.

May 13, 2026
thrift and tokens (printing press)

Everyone knows token economy is 💰

I came across this new abstraction layer for integrating external tooling into LLMs, called printingpress.dev.

From an API spec, from a website with no public API, from a beloved community fan project – one command prints a token-efficient Go CLI, a Claude Code skill, an OpenClaw skill, and an MCP server. Peter Steinberger showed the way with discrawl and gogcli: a local SQLite mirror beats a remote API call, compound commands beat ten round trips, and an agent-native CLI beats raw HTTP. The press bakes that playbook into every binary it prints. Muscle memory for agents.

It uses custom compiled CLI saving valuable token exchange commonly seen with MCP, connectors etc. MCP traffic is heavy. Part of why my exploration into local LLMs stopped was that I realized how much context an MCP exchange takes.

Similar to how exa, tavily MCPs clean up garbage from the web to provide LLM clean search, printingpress goes a step further and forgoes the whole MCP exchange for a CLI interface that runs locally and does all the dirty work more efficiently, saving your tokens.

The beauty of it is there is also a prompt kit that helps generate brand new CLIs from any service. So point it at a service and watch it go.

May 11, 2026
Faster! Faster!
Having played with LLMs for a few years now I’ve had various stages of appreciation for its efficiency.
1. Tell me some jokes..
2. You coded a debugging nightmare.
3. Hey this is kinda neat.
4. Think for me. I’m too lazy to look it up.
5. Spawn five of yourself and wire it up.
It happened so fast. For me, tool use has been the most eye opening. To see Claude computer use, review functionality, that it just implemented, by itself, by literally clicking around the iOS app it just built, is astounding.

I dove into an article on Sherwood News, Test time. It made me think about hiring the best people for tomorrow. Imagine you are looking at candidates. How can one justify hiring someone who has no experience with the potential of LLMs?

Instead of simply talking through strategy, some CMOs, investors, and operators are now being asked to use AI tools live — or during a tight take-home window — to create something in front of interviewers. A number of other firms do the same, while Nicole DeTommaso, a principal at Harlem Capital, says that anecdotally, she’s seen practically every potential candidate looking to join a venture capital firm being asked to show their prowess with AI coding tools.

DeTommaso wrote that one candidate was asked to build an AI agent that could produce automated research about industries within a working week that could reliably brief partners on a sector before they invested. Another needed to use the likes of Claude Code and Codex to vibe code a dashboard to show information about portfolio companies.

“You are not told which tools to use or how to go about it. You are just expected to figure it out,” she wrote. “And increasingly, what you can actually show in an interview matters more than what’s on your resume.”

At an individual contributor level it seems risky to hire someone who would be doing things “the old way”. It’s like signing a flat footed defence man in the world of Cale Makars. Speed is the game now. And at a leader level, Arguably it applies too where the best managers should excel at delegating to LLMs. It’s easier than ever to test and prototype. At a fraction of the cost before AI.
May 7, 2026