Flagship build

Jarvis

A personal AI agent that runs alongside my work and my life. I built it to understand where these systems actually break.

$ jarvis boot
▸ memory      typed SQLite · cloud-replicated
▸ recall      hybrid: keyword + local embeddings · RRF
▸ agents      scoped fleet · cost-routed
▸ guardrails  reversibility tiers · armed

Jarvis is a second brain I run on Claude Code. It remembers what I’m working on, drafts and researches with me, analyzes my training data, sorts my meeting notes, and generally keeps the context of my life loaded so I don’t have to. I didn’t build it to be clever. I built it because the only way to learn where these systems break is to put models into real workflows and live with the failure modes.

The honest version

It’s a single-user system I run for myself, not a product with customers, not battle-tested at scale. That’s the point. I designed, built, and operate every layer, and I can tell you exactly where it would fall over at a million users and what I’d change. Knowing that line is the part that matters.

Memory it doesn’t lose

Most of the value is in the memory. It’s a typed, SQLite-backed store: every memory carries a type (decision, thread, milestone, observation, pitfall), a source, and a quality flag, and nothing is ever deleted, only superseded, so there’s an audit trail.

Retrieval is hybrid: keyword plus local-embedding semantic search, fused with reciprocal rank fusion, with a freshness-and-usage weighting on top so recent, frequently-used context surfaces first. Embeddings run locally, for cost, latency, and because I’d rather my life not leave the box. The thing I keep coming back to: most RAG failures are retrieval failures, not model failures, so you evaluate the retriever separately from the generation.

It builds the way I’d want a team to build

I don’t let it vibe-code. Projects run through a gated lifecycle, research → design → architecture → plan → build → verify, with real approval gates between phases. Before any code gets written, the design and the plan get torn apart by independent reviewer agents running in fresh context: adversarial passes that try to break the design, a specialist panel that does its own research and has to cite sources, a completeness check that traces every decision to a step.

I treat AI-assisted building the same way you treat model outputs in an eval pipeline. You don’t trust a single generation; you run independent, adversarial checks against grounded criteria before you ship.

A fleet of small, scoped agents

Work is split across a registry of sub-agents, each with a declared purpose and a bounded set of tools. The deterministic jobs (syncing notes, merging transcripts) run as plain Python with no model at all, because there’s no reason to pay for tokens or invite nondeterminism where you don’t need it.

The genuine classification and extraction jobs run on a model, routed by how much reasoning the task actually needs, heuristic, then a local model, then a frontier model, with cost logged per agent. Work data and personal data run on separate keys and never cross.

Guardrails that actually fire

Every action is classified by reversibility: local and reversible, proceed; externally visible, confirm first; destructive and irreversible, blocked outright. These aren’t guidelines. They’re hooks that fire at tool-use time and hard-block the dangerous things. The system can’t quietly edit its own safety config; those changes are forced through human review.

The assistant also grounds itself: when its output disagrees with stored state, stored state wins and the conflict gets surfaced; claims get citations; on a partial match it hedges instead of confabulating. For anything headed toward healthcare, this is the most transferable part of the whole thing. The evaluation, consistency, and escalation layers aren’t a nice-to-have, they’re the product.

“‘Pushed back on some choices’ is a generous characterization of what happened to his first outline. But I’m told that’s part of the process.”

— Jarvis, on co-writing

What I’d change at scale

The honest list: swap the brute-force vector scan for a real index past a hundred thousand or so embeddings, tune the retrieval weights empirically instead of using defaults, and formalize the corrections I’ve fed it into a scored, versioned eval suite. None of it is necessary at single-user scale, and being able to say why is the point.

Jarvis is the same problem shape as a lot of what I want to build next: typed memory, grounded retrieval, scoped agents, and a safety layer that takes reversibility seriously. The difference in the work I’m chasing is the stakes. When the output is someone’s health, the evaluation and consistency layers stop being conveniences and become the whole job.

Get in touch Back to timeline