AI Today: Still Missing the Last 10%

There’s an old rule in software engineering: the first 90% of the work takes 90% of the time. The last 10% takes the other 90%. If you’ve shipped software, you’ve lived this. The feature works in the demo. Then you hit edge cases, permissions issues, state management you didn’t anticipate. The thing that looked done isn’t even close.

I’ve been thinking about this a lot lately with AI agents, specifically the difference between what Claude can do in a scheduled task and what it can do when I dispatch work from my phone.

What’s working is genuinely amazing! Every morning at 7am, a daily brief gets written and lands in my Obsidian vault, a curated summary of my feeds, organized by topic, ready when I wake up. It scans my inbox, takes my notes from yesterday and surfaces tasks I need to act on. At noon, my inbox gets processed again. I can drop a note from my iPhone, and by the time I’m at my desk it’s been classified, linked, and filed. I published a blog post from my phone this week. Not only is it filed away by rules or tags that I added, but I can put unstructured things in there and magically a brain puts everything in its place.

But there are still gaps. Especially when it comes to enabling mobile for use here. I want to be able to take notes anywhere with any device and this has complications, as I want to ask my desktop to execute my co-work tasks or do things like I would on my desktop or laptop environment. And you would think something like Dispatch which interacts with my desktop would allow me to what I need. But, today it can’t, because it runs in a sandbox.

The sandbox is needed, but also creates complication that normal users wouldn’t tolerate

Claude dispatch runs in an isolated Linux VM. It can read and write to my vault, run scripts, fetch web results, talk to APIs I’ve explicitly connected, things I have given Cowork the ability to work on. What it can’t do.. Access my SSH keys, push to git over SSH, force-sync iCloud files with brctl or reach network resources that require native Mac credentials. When I tried to automate publishing my blog, which involves a git push to a remote, the VM simply didn’t have what it needed. No SSH agent. No keychain. Blocked at the network boundary.

We worked around it. There’s now a post-commit hook that runs natively on my Mac and handles the push after Claude commits. It works. But “it works via workaround” is not the same as “it just works.”

And building this would be 100% out of the realm of possibility for the non-techy people in my life. Would I give my non-technical colleagues this product and have them work with it? Knowing they would have to figure out how to build hooks for Claude or Claude Code? Or that they would have to build a cron job to force copy my obsidian vault’s files to get iCloud to copy them around?

The answer is clearly no. That is why the Sandbox is awesome for security, but horrible for usability. It makes sure I have given it access. But, my non-technical folks would never have persisted to make this work. It is the same as my annual “Can I make my iPad work as my only computer?” To find out.. I can’t. But the journey isn’t one that my friends or coworkers would do, as it is painful and just plain easier to use a laptop.

The sandbox is probably the right call…. For now.

Think about what you’re actually giving an AI agent access to when you let it run on your machine. If that agent is compromised, misconfigured, or just wrong about something, you want a blast radius that at least has a limit. You don’t want it to have your SSH keys, your SSN or your credit cards. You don’t want it touching files outside its lane. The principle of least privilege isn’t just a security checkbox. A Claude that can do anything is also a Claude that can do anything wrong.

So the sandboxing isn’t a bug. It’s a deliberate tradeoff. You give up some reach in exchange for trust boundaries. The cost is the operations that need native credentials or OS-level access still need a human in the loop. For now.

But it is a barrier for non technical people, they want the AI to have all the context, but don’t want to take the risk of it having their SSN, Credit card or something else that could cause the blast radius issues. None of the current AI companies have the kind of institutional trust that makes people hand over the keys without thinking twice. If we had an Apple AI and safety around it, people would trust it. Apple has spent twenty years building a reputation around privacy and user control. That kind of earned trust doesn’t come with a product launch. But in the form it is in now, even I don’t trust it.

NVIDIA is looking at a way to enable this with NemoClaw, but these are early days. A security and policy engine is just a piece of a layered security model. And we should really look at securing agents this way. That means looking at ways to avoid social engineering for agents, or hidden code that causes problems. These detections will come, but over time as we find the next set of major security vulnerabilities.

This is the last 10%

AI getting to “useful” was the first 90%. That part happened fast and surprised everyone, including people who’d been working on it for years. The models are amazing, the reasoning is solid. The breadth of what they can handle has crossed some threshold where it stops being a toy and starts being surprising to most people. And moving from Chat to Agent is a completely different thing.

But “fully autonomous, trusted for the last mile” is a different paradigm, it is about trust infrastructure, not model capability. The models are already capable enough for most tasks and likely smarter than us. The bottleneck isn’t intelligence, it’s trust infrastructure: authentication, sandboxing, permissions, audit. How do you let an agent push to production without giving it your root password? How do you let it send email on your behalf without worrying it’ll send the wrong one or the wrong details to the person you are communicating to? How do you know what it did and why? The Google MCP integration lets you have it read your email and draft actions. But you as a human need to go to drafts and decide to send.

These are hard problems. Not hard in the “needs more research” sense, but hard in the “requires careful engineering, broad adoption, and probably a few high-profile incidents to get right” sense. Like HTTPS. Like OAuth. Like every piece of security infrastructure that felt like overkill until it didn’t.

OpenClaw vs NemoClaw and the whole “Corporate AI” paradigm.

OpenClaw launched in late 2025, built by Austrian developer Peter Steinberger. It started life as Clawdbot, got renamed, and reached 247,000 GitHub stars by March 2026. The concept: an open-source AI agent that runs on your machine, connects through Telegram, Slack, or WhatsApp, and can actually execute things. Shell commands, email, file operations, browser automation. One of its own maintainers warned that the project is “far too dangerous” for anyone who doesn’t understand the command line. That’s not a bug report. That’s the product description.

NVIDIA saw this and built NemoClaw on top of it. (You may have seen it written as NeroClaw, which is close.) NemoClaw adds a policy engine and network firewall through a security runtime called OpenShell. The agent starts with zero permissions. Every network request gets blocked and surfaced for explicit approval. The policy engine evaluates actions at the binary, destination, method, and path level before anything runs.

It is freaky that this product needed to exist. We built a capable, widely-adopted AI agent framework, and then NVIDIA had to ship a separate product to put guardrails around it before anyone could responsibly run it in production. That is not a sign of technology being too immature. It is a sign of the trust infrastructure not being there yet. We have the car. We are still building the seatbelts.

What this means for the jobs conversation

The “AI will replace jobs” debate usually gets stuck here if people are paying attention, but by looking at all the layoffs in the industry they clearly aren’t. The roles that are most exposed aren’t the ones that require creativity or judgment. Those are already being disrupted. The roles that are safest are the ones where the last 10% is the whole product. A lawyer can’t be right 90% of the time. A surgeon can’t. An accountant filing your taxes can’t.

AI won’t fully replace those roles until an agent can operate with the same accountability structures we expect from a human. That’s a governance and systems engineering and social engineering problem more than a model capability problem. And it’s genuinely unsolved to date, otherwise we would have more atonoumous cars. Are you telling me you trust something to code a system that holds medical records but don’t trust it to drive around?

We’re closer than we were. The fact that I have a working daily brief, automated inbox processing, and a publishing workflow that mostly runs itself, on a timeline I couldn’t have predicted two years ago, tells me the curve is steep. But the last 10% is going to take a while.

The bar isn’t human-level

Here’s something that gets glossed over: for AI to genuinely handle the last 10%, it can’t just be as good as a human. It has to be measurably better.

Think about self-driving cars. The goal was never “drive as well as an average human.” Average human drivers get distracted, run red lights, fall asleep. That bar is too low. Autonomous vehicles need to be so much better, so consistently, that you can extend trust without constant supervision. AI agents face the same challenge. A human who occasionally forgets a git credential or misconfigures an environment variable is frustrating but forgivable. An AI that does the same thing silently, at scale, across every user’s system is a different category of problem entirely.

This is part of why I’m skeptical of two things that get oversold right now.

The first is large context windows. There’s a temptation to believe that if you just give an AI enough context, it can manage a complex system end-to-end. But context isn’t memory, and window size isn’t understanding. When work spans many sessions, files, repos, and tools, the AI is holding a snapshot, not a model of the system. Cross-cutting concerns like auth, state, and side effects don’t collapse just because you have 1 Million tokens. Additionally, most of the models (at least today) get fairly delusional after 750k tokens anyway. One other thing to note about projects is that the documents you upload to them get tokenized, and don’t cost you nearly as much if you use them. A trick for managing token usage. 😊

The second is what I’ve started calling the 50 First Dates problem. Each session starts fresh. Most AI agents don’t remember last Tuesday’s conversation, the workaround we agreed on, or why a particular decision was made. Every session I have to re-establish context: what we’re doing, what the constraints are, what went wrong last time. Session summaries and CLAUDE.md files help. Systems like projects in Claude code, mempalace and PAI can help bridge the gap and I use them. But they’re a patch over a fundamental gap. Real trust in an AI system requires persistent, reliable memory. Not just a long enough note passed between strangers.

None of this means the trajectory is wrong. It means the last 10% is doing real work. And likely farther out than we think. We will see how well this post ages. But, the speed of innovation in this space is so much faster than hardware and Moore’s law, because software can evolve so much faster.