The agent runner · internal alpha track

The agent runner
we're dogfooding.

Merlin reads your module specs, writes code that matches them, and runs the verification lane before reporting done. Multi-provider, tier-aware, plugin-shaped, with a runtime-enforced safety substrate independent of model adherence. Built on spec-sync contracts and the fledge plugin protocol.

How it works → Benchmarks ↗ Security model →

Invite-only alpha next. CLI users can still build from source or install release artifacts while the app path hardens.

# Set up this checkout
$ merlin init
✓ built + registered 44 plugins

# Health check from any folder
$ merlin doctor
✓ runtime bundled
✓ plugins 44 ok
✓ keys keychain

# Run a task, tier-clamped
$ merlin --tier code "add a flag to parse"
▸ files-edit src/cli.rs
→ fledge lanes run verify --ni
✓ verified · v0.8.4

What it looks like

One runner, three ways to drive it: the desktop app (GUI), the full terminal UI (TUI), or inline from the command line (CLI). Multi-provider, and every run is logged.

GUI · the desktop app

Merlin desktop chat: a question and Merlin's formatted answer with a code block — **Chat + Activity.** Ask Merlin to inspect, edit, test, or explain. The Activity panel on the left streams every tool and shell command as it runs, so you can watch the plugins work.

Merlin running a tool: agent-running state with live shell output and a Stop control — **Running.** Tools and shell output stream live while the agent works. Queue more, or stop it.

Merlin task history: past sessions with provider, model, duration, and token counts — **Task history.** Every run is logged with its provider, model, duration, and token cost.

TUI · the full terminal app

The Merlin TUI: Transcript and Activity panes with a status line. — **Live in the terminal.** The same runner as a full TUI: a Transcript pane, a live Activity pane showing each tool and memory call as it happens, and a status line tracking context usage. No app required.

CLI · inline from the command line

Or scriptable from the shell. No UI at all: merlin run and a full command surface (account, audit, and the rest), pipeable and CI-friendly.

Local-first by design

Your machine. Your keys. Your code.

Merlin ships as a desktop app with the CLI bundled inside: download once and both are set up. App-first, not app-only, though. The GUI is the product now, not just a shell around the CLI, but the CLI still runs on its own for agents, scripts, fledge lanes, and release verification.

Local-first

Your code and your provider keys stay on your machine. No CorvidLabs server in the loop, no telemetry on by default.

Bring your own keys

Provider credentials resolve from the OS keychain (macOS Keychain, Linux Secret Service): the same path the desktop app and managed CLI both use. Never written to disk in plaintext.

Desktop app + managed CLI

The macOS app opens into onboarding, provider readiness, recent projects, and a chat workspace. It installs the merlin command pointed at the bundled runtime, no source checkout needed.

Spec-aware agent loop

fledge, spec-sync, provider selection, plugins, verification, and task history live in one flow. Specs are the contract; fledge lanes run verify is the success oracle.

Six things that set it apart

What makes it different.

See how it works →

Multi-provider

31 pre-configured providers across three provider types (Anthropic, OpenAI-compatible, Ollama): Anthropic, OpenAI ×11 (gpt-5 family, o-series, 4o), OpenRouter ×5, Groq, Together, Ollama Cloud ×11. Adding a type means one trait and a factory branch.

Tier-aware

Three tiers: read (answer-only), tool (can call plugins), code (can write + run code). --tier code clamps the agent's tool surface to match: a read-tier run physically cannot reach files-edit.

Plugin tool surface

44 bundled fledge plugins ship in-tree. Each is its own Rust binary speaking the fledge-v1 JSON-lines protocol: process boundary, reviewed dangerous flag, min_tier classification. Write a new plugin, add to [merlin.tools]. No fork-and-edit.

Spec-driven verify loop

Specs in specs//.spec.md are the contract; fledge lanes run verify is the success oracle. Verify-exhausted rollback restores files to the last green tree state. Nobody else treats verify pass as the rollback anchor.

Sub-agent delegation

merlin-subagent fans heavy work out to fresh-context children; the parent sees a compact JSON envelope, not the child's transcript. Closes the context-retention gap that motivated the plugin in the first place.

Runtime-enforced safety

Destructive SQL gated, project infra (fledge.toml / .git / .env) unremovable, shell-exec cwd-clamped to the project root, on-main commits refused at the plugin layer, audit log HMAC-chained + merlin audit verify, session key in the OS keychain. Defenses fire from the plugin side; they do not depend on the model behaving.

The agent loop

A state machine, walked once per task.

One call to run_task walks the same loop every time: plan, execute tool calls, then verify against the fledge lane before reporting. Verification is the gate, not a suggestion: a failed lane sends the agent back to work with the verifier output appended, until the retries run out or the tree is green.

The states

01 Planning. Memory recalled, specs loaded, your message added to context. No LLM call yet.
02 Executing. Streamed LLM call. A ToolUse stop re-enters Executing; an EndTurn stop moves to Verifying.
03 Verifying. fledge lanes run verify --ni runs as a gate. Pass moves to Reporting; failure with retries left re-enters Executing with the verifier output appended.
04 Reporting. Save the task summary as ephemeral memory, build the final TaskResult, return to Idle.

One task, end to end

$ merlin "wire --stdin into keys set"
→ planning loaded 3 specs
▸ files-read src/keys.rs
▸ files-edit src/keys.rs
▸ files-edit src/cli.rs
→ verifying fledge lanes run verify --ni
✓ fmt-check 110ms
✓ clippy 4.1s
✓ test 22.3s
✓ spec-check 260ms
→ reporting verified=true · 2 files

On a failed lane with retries left, Merlin re-enters Executing with the verifier output as context. Exhaust the retries and verify-exhausted rollback restores files to the last green tree state.

The tool surface

Every tool is a fledge plugin.

Merlin's agent does not bake its tools in. Each of the 44 bundled tools is its own Rust binary speaking the fledge-v1 JSON-lines protocol across a process boundary. Every plugin carries a reviewed dangerous flag and a min_tier, so the tier you run at decides which tools even exist for that turn. Write a new plugin, add it to [merlin.tools]: no fork-and-edit.

Files + search

code+

files, search, pattern, diff

Shell + runners

code+

shell, cargo, swift, python, node, gradle

Git + GitHub

code+

git, github, gitleaks, work-tasks

Web + docs

tool+

web (fetch + search), doc (PDF), vision, voice

Quality gates

tool+

rustcheck, typecheck, jsonval, specsync, loopcheck

Orchestration

tool+

delegation, plan, checklist, scheduling, councils

Three tiers, one filter

read Answer-only. The agent can read and reason but cannot call mutating tools.
tool Can call plugins: web, docs, search, quality gates. No code writes.
code Full surface: write files, run shells and language runners, drive git.

The tier clamps the surface

# read tier: files-edit does not exist
$ merlin --tier read "what does cli.rs do?"
▸ files-read src/cli.rs
→ answer (no mutating tools offered)

# code tier: full surface, cheapest match
$ merlin --tier code "fix the clippy warning"
▸ files-edit src/lib.rs
▸ shell cargo clippy
✓ verified

Key concepts

The vocabulary; everything else composes from these.

tier: A capability ceiling: read, tool, or code. --tier code picks the cheapest provider that satisfies it and clamps the tool surface.
provider: An AI backend. 31 ship pre-configured across anthropic, openai-compatible, and ollama types, selected with --provider or by tier.
tool: A bundled fledge plugin the agent can call. Each is a Rust binary speaking the fledge-v1 JSON-lines protocol with a declared dangerous flag.
spec: A specs//.spec.md contract. fledge spec check gates CI; the agent reads specs as constraints before it writes code.
verify lane: fledge lanes run verify: fmt, clippy, tests, and spec-check. The success oracle the agent must pass before it reports done.
session: A persisted run, addressable by an 8-hex id. merlin --resume [ID] reopens it; merlin sessions lists them.

Configure once, run anywhere

One config file. Two run paths.

A real fledge.toml with Merlin enabled: three providers, different tiers, spend caps, telemetry off by default. The same config drives the desktop app and the managed CLI.

fledge.toml

# fledge.toml
[merlin]
verify_before_complete = true
max_retries = 3
persona = "AGENT.md"
rollback_on_verify_exhausted = true

[merlin.spend_caps]
daily_per_provider_usd = 5.0
daily_total_usd = 10.0

[merlin.providers.claude]
tier = "code"
type = "anthropic"
api_key_env = "ANTHROPIC_API_KEY"
model = "claude-sonnet-4-6"

[merlin.providers.ollama]
tier = "code"
type = "openai"
api_key_env = "OLLAMA_API_KEY"
model = "qwen3-coder:480b"
base_url = "https://ollama.com/api/v1"

CLI flags worth knowing

--tier read|tool|code: Pick the cheapest tier-satisfying provider; clamp the tool surface to the tier.
--provider <name>: Override the configured default provider for this run.
--project <path>: Run as if cd first: env and fledge.toml resolve from there.
--resume [id]: Resume the most recent session, or a specific 8-hex session id.
--non-interactive: No REPL. Auto-deny dangerous tool calls; pair with --allow / --deny globs.
--output text|json|ndjson: Human output, a single final JSON result, or a streaming event log.

Bare merlin drops into an interactive REPL. Every flag is real and copy-pasteable.

Current focus

App-first internal alpha

The release gate now moves from "can the dev tool work?" to "can a user install the app, add keys securely, install the CLI from the app, and run Merlin from any folder without knowing about the repo layout?"

Distribution Apple Silicon macOS DMG lane bundles Merlin.app, the CLI, official plugins, icon assets, and runtime metadata

CLI Desktop-managed install creates ~/.local/bin/merlin as a symlink to the bundled CLI and verifies merlin --version

Runtime MERLIN_RUNTIME_ROOT plus app-bundle runtime detection lets the installed CLI find bundled plugins without a source checkout

Keys Provider readiness uses the same credential resolver as the agent, including OS keychain / keyring sources

Onboarding Desktop first-run flow centers provider keys, recent projects, CLI install, and update status instead of repo-local setup

Updates Lightweight release checking is planned first; automatic patching is deferred until after the alpha loop is stable

Linux AppImage follows the same runtime + managed CLI model after the macOS release path is boring

Scope AlgoChat is intentionally preview / 1.1+ while packaging, keys, onboarding, and app reliability stay on the 1.0 path

What we're gating now

The checklist a build has to clear before it ships. Linux follows once the macOS loop is boring: an x86_64 AppImage with the same bundled runtime and app-managed CLI behavior.

Build the macOS release lane from fresh main.
Open the DMG and verify the icon, Applications shortcut, and app metadata.
Launch Merlin.app, complete onboarding, and save a provider key through the OS keychain.
Install ~/.local/bin/merlin from the app and verify PATH guidance is copyable when needed.
Run merlin --version, merlin doctor, and a provider-backed chat from a folder with no Merlin checkout.
Review desktop snapshots across panels and themes for contrast, clarity, and product polish.

Read the detailed Merlin engineering note →

The dependency chain

Merlin sat at the start of the CorvidLabs 2026 strategy chain. Sub-agent delegation, the original load-bearing move, shipped in v0.3.x. The chain now reads as completed groundwork plus what's next.

✓ Done. Sub-agent plugin shipped (`merlin-subagent`); context-retention gap closed
✓ Done. Merlin handles real work-tasks end-to-end (autonomous mode, work-tasks plugin)
✓ Done. Dropped Claude Code as the daily driver; dogfooding on Merlin since v0.4.0
→ In flight. 1.0 release: sub-agents, long-task durability, the audit chain, and spend caps have landed; what is left is install, onboarding, and reliability polish
5 Dogfooding budget frees for corvid-chat polish
6 corvid-chat dogfood + polish; showcase product ready to launch

Defense in depth

Honest security claims.

Defense-in-depth for solo-developer agent workflows. Every claim below is a thing the code actually does, and the guards fire from the plugin side, so they do not depend on the model behaving. See SECURITY.md for the vuln-report channel and the threat model for what we explicitly DON'T protect against.

Destructive SQL gated. sql-run refuses DELETE / DROP / UPDATE / TRUNCATE by default; memory-delete uses a two-phase confirm token.

Project infra is unremovable. files-delete hard-refuses .git / .env* / Cargo.toml / fledge.toml / *.spec.md with no override.

Shell is cwd-clamped. shell-exec clamps the working directory to the project root; cd /elsewhere is refused at the plugin layer, and on-main commits are refused too.

Audit log is tamper-evident. Every destructive-op row is HMAC-chained; merlin audit verify detects tampering and forgery.

Session key in the OS keychain, keyed per project (on-disk fallback for headless or locked-keychain environments). The audit-chain key is HKDF-derived and project-bound.

Network egress is guarded. web-fetch / web-search refuse private IPs through a shared, DNS-rebind-safe SSRF guard (loopback / link-local / multicast blocked).

Secrets are redacted before persistence. Vendor key patterns are scrubbed before storage; merlin redact-history re-applies current rules retroactively.

Spend caps refuse provider calls that would exceed daily USD budgets.

Data-deletion right. merlin wipe --confirm nukes session DB + memory + audit + keychain + cost tracking. Idempotent. Audits the wipe itself.

Dangerous tools need consent. Under --non-interactive, dangerous tools auto-deny; --allow / --deny globs let bridges scope safely.