The agent runner · internal alpha track
The agent runner
we're dogfooding.
Merlin reads your module specs, writes code that matches them, and runs the verification lane before reporting done. Multi-provider, tier-aware, plugin-shaped, with a runtime-enforced safety substrate independent of model adherence. Built on spec-sync contracts and the fledge plugin protocol.
Invite-only alpha next. CLI users can still build from source or install release artifacts while the app path hardens.
$ merlin init
✓ built + registered 44 plugins
# Health check from any folder
$ merlin doctor
✓ runtime bundled
✓ plugins 44 ok
✓ keys keychain
# Run a task, tier-clamped
$ merlin --tier code "add a flag to parse"
▸ files-edit src/cli.rs
→ fledge lanes run verify --ni
✓ verified · v0.8.4
What it looks like
One runner, three ways to drive it: the desktop app (GUI), the full terminal UI (TUI), or inline from the command line (CLI). Multi-provider, and every run is logged.
GUI · the desktop app
TUI · the full terminal app
CLI · inline from the command line
merlin run and a full command surface (account, audit, and the rest), pipeable and CI-friendly.
Local-first by design
Your machine. Your keys. Your code.
Merlin ships as a desktop app with the CLI bundled inside: download once and both are set up. App-first, not app-only, though. The GUI is the product now, not just a shell around the CLI, but the CLI still runs on its own for agents, scripts, fledge lanes, and release verification.
Local-first
Your code and your provider keys stay on your machine. No CorvidLabs server in the loop, no telemetry on by default.
Bring your own keys
Provider credentials resolve from the OS keychain (macOS Keychain, Linux Secret Service): the same path the desktop app and managed CLI both use. Never written to disk in plaintext.
Desktop app + managed CLI
The macOS app opens into onboarding, provider readiness, recent projects, and a chat workspace. It installs the merlin command pointed at the bundled runtime, no source checkout needed.
Spec-aware agent loop
fledge, spec-sync, provider selection, plugins, verification, and task history live in one flow. Specs are the contract; fledge lanes run verify is the success oracle.
Six things that set it apart
What makes it different.
Multi-provider
31 pre-configured providers across three provider types (Anthropic, OpenAI-compatible, Ollama): Anthropic, OpenAI ×11 (gpt-5 family, o-series, 4o), OpenRouter ×5, Groq, Together, Ollama Cloud ×11. Adding a type means one trait and a factory branch.
Tier-aware
Three tiers: read (answer-only), tool (can call plugins), code (can write + run code). --tier code clamps the agent's tool surface to match: a read-tier run physically cannot reach files-edit.
Plugin tool surface
44 bundled fledge plugins ship in-tree. Each is its own Rust binary speaking the fledge-v1 JSON-lines protocol: process boundary, reviewed dangerous flag, min_tier classification. Write a new plugin, add to [merlin.tools]. No fork-and-edit.
Spec-driven verify loop
Specs in specs/ are the contract; fledge lanes run verify is the success oracle. Verify-exhausted rollback restores files to the last green tree state. Nobody else treats verify pass as the rollback anchor.
Sub-agent delegation
merlin-subagent fans heavy work out to fresh-context children; the parent sees a compact JSON envelope, not the child's transcript. Closes the context-retention gap that motivated the plugin in the first place.
Runtime-enforced safety
Destructive SQL gated, project infra (fledge.toml / .git / .env) unremovable, shell-exec cwd-clamped to the project root, on-main commits refused at the plugin layer, audit log HMAC-chained + merlin audit verify, session key in the OS keychain. Defenses fire from the plugin side; they do not depend on the model behaving.
The agent loop
A state machine, walked once per task.
One call to run_task walks
the same loop every time: plan, execute tool calls, then
verify against the fledge lane before
reporting. Verification is the gate, not a suggestion: a failed lane sends
the agent back to work with the verifier output appended, until the retries
run out or the tree is green.
The states
- 01 Planning. Memory recalled, specs loaded, your message added to context. No LLM call yet.
- 02 Executing. Streamed LLM call. A
ToolUsestop re-enters Executing; anEndTurnstop moves to Verifying. - 03 Verifying.
fledge lanes run verify --niruns as a gate. Pass moves to Reporting; failure with retries left re-enters Executing with the verifier output appended. - 04 Reporting. Save the task summary as ephemeral memory, build the final
TaskResult, return to Idle.
One task, end to end
→ planning loaded 3 specs
▸ files-read src/keys.rs
▸ files-edit src/keys.rs
▸ files-edit src/cli.rs
→ verifying fledge lanes run verify --ni
✓ fmt-check 110ms
✓ clippy 4.1s
✓ test 22.3s
✓ spec-check 260ms
→ reporting verified=true · 2 files
On a failed lane with retries left, Merlin re-enters Executing with the verifier output as context. Exhaust the retries and verify-exhausted rollback restores files to the last green tree state.
The tool surface
Every tool is a fledge plugin.
Merlin's agent does not bake its tools in. Each of the 44 bundled tools is
its own Rust binary speaking the
fledge-v1 JSON-lines
protocol across a process boundary. Every plugin carries a reviewed
dangerous flag and a
min_tier, so the
tier you run at decides which tools even exist
for that turn. Write a new plugin, add it to
[merlin.tools]: no fork-and-edit.
Files + search
code+files, search, pattern, diff
Shell + runners
code+shell, cargo, swift, python, node, gradle
Git + GitHub
code+git, github, gitleaks, work-tasks
Web + docs
tool+web (fetch + search), doc (PDF), vision, voice
Quality gates
tool+rustcheck, typecheck, jsonval, specsync, loopcheck
Orchestration
tool+delegation, plan, checklist, scheduling, councils
Three tiers, one filter
- read Answer-only. The agent can read and reason but cannot call mutating tools.
- tool Can call plugins: web, docs, search, quality gates. No code writes.
- code Full surface: write files, run shells and language runners, drive git.
The tier clamps the surface
$ merlin --tier read "what does cli.rs do?"
▸ files-read src/cli.rs
→ answer (no mutating tools offered)
# code tier: full surface, cheapest match
$ merlin --tier code "fix the clippy warning"
▸ files-edit src/lib.rs
▸ shell cargo clippy
✓ verified
Key concepts
The vocabulary; everything else composes from these.
- tier
- A capability ceiling:
read,tool, orcode.--tier codepicks the cheapest provider that satisfies it and clamps the tool surface. - provider
- An AI backend. 31 ship pre-configured across
anthropic,openai-compatible, andollamatypes, selected with--provideror by tier. - tool
- A bundled fledge plugin the agent can call. Each is a Rust binary speaking the fledge-v1 JSON-lines protocol with a declared
dangerousflag. - spec
- A
specs/contract./ .spec.md fledge spec checkgates CI; the agent reads specs as constraints before it writes code. - verify lane
fledge lanes run verify: fmt, clippy, tests, and spec-check. The success oracle the agent must pass before it reports done.- session
- A persisted run, addressable by an 8-hex id.
merlin --resume [ID]reopens it;merlin sessionslists them.
Configure once, run anywhere
One config file. Two run paths.
A real fledge.toml with
Merlin enabled: three providers, different tiers, spend caps, telemetry off
by default. The same config drives the desktop app and the managed CLI.
fledge.toml
[merlin]
verify_before_complete = true
max_retries = 3
persona = "AGENT.md"
rollback_on_verify_exhausted = true
[merlin.spend_caps]
daily_per_provider_usd = 5.0
daily_total_usd = 10.0
[merlin.providers.claude]
tier = "code"
type = "anthropic"
api_key_env = "ANTHROPIC_API_KEY"
model = "claude-sonnet-4-6"
[merlin.providers.ollama]
tier = "code"
type = "openai"
api_key_env = "OLLAMA_API_KEY"
model = "qwen3-coder:480b"
base_url = "https://ollama.com/api/v1"
CLI flags worth knowing
- --tier read|tool|code
- Pick the cheapest tier-satisfying provider; clamp the tool surface to the tier.
- --provider <name>
- Override the configured default provider for this run.
- --project <path>
- Run as if
cdfirst: env and fledge.toml resolve from there. - --resume [id]
- Resume the most recent session, or a specific 8-hex session id.
- --non-interactive
- No REPL. Auto-deny dangerous tool calls; pair with
--allow/--denyglobs. - --output text|json|ndjson
- Human output, a single final JSON result, or a streaming event log.
Bare merlin
drops into an interactive REPL. Every flag is real and copy-pasteable.
Current focus
App-first internal alpha
The release gate now moves from "can the dev tool work?" to "can a user install the app, add keys securely, install the CLI from the app, and run Merlin from any folder without knowing about the repo layout?"
~/.local/bin/merlin as a symlink to the bundled CLI and verifies merlin --version MERLIN_RUNTIME_ROOT plus app-bundle runtime detection lets the installed CLI find bundled plugins without a source checkout What we're gating now
The checklist a build has to clear before it ships. Linux follows once the macOS loop is boring: an x86_64 AppImage with the same bundled runtime and app-managed CLI behavior.
- Build the macOS release lane from fresh main.
- Open the DMG and verify the icon, Applications shortcut, and app metadata.
- Launch Merlin.app, complete onboarding, and save a provider key through the OS keychain.
- Install
~/.local/bin/merlinfrom the app and verify PATH guidance is copyable when needed. - Run
merlin --version,merlin doctor, and a provider-backed chat from a folder with no Merlin checkout. - Review desktop snapshots across panels and themes for contrast, clarity, and product polish.
The dependency chain
Merlin sat at the start of the CorvidLabs 2026 strategy chain. Sub-agent delegation, the original load-bearing move, shipped in v0.3.x. The chain now reads as completed groundwork plus what's next.
- ✓ Done. Sub-agent plugin shipped (`merlin-subagent`); context-retention gap closed
- ✓ Done. Merlin handles real work-tasks end-to-end (autonomous mode, work-tasks plugin)
- ✓ Done. Dropped Claude Code as the daily driver; dogfooding on Merlin since v0.4.0
- → In flight. 1.0 release: sub-agents, long-task durability, the audit chain, and spend caps have landed; what is left is install, onboarding, and reliability polish
- 5 Dogfooding budget frees for corvid-chat polish
- 6 corvid-chat dogfood + polish; showcase product ready to launch
Defense in depth
Honest security claims.
Defense-in-depth for solo-developer agent workflows. Every claim below is a thing the code actually does, and the guards fire from the plugin side, so they do not depend on the model behaving. See SECURITY.md for the vuln-report channel and the threat model for what we explicitly DON'T protect against.
Destructive SQL gated. sql-run refuses DELETE / DROP / UPDATE / TRUNCATE by default; memory-delete uses a two-phase confirm token.
Project infra is unremovable. files-delete hard-refuses .git / .env* / Cargo.toml / fledge.toml / *.spec.md with no override.
Shell is cwd-clamped. shell-exec clamps the working directory to the project root; cd /elsewhere is refused at the plugin layer, and on-main commits are refused too.
Audit log is tamper-evident. Every destructive-op row is HMAC-chained; merlin audit verify detects tampering and forgery.
Session key in the OS keychain, keyed per project (on-disk fallback for headless or locked-keychain environments). The audit-chain key is HKDF-derived and project-bound.
Network egress is guarded. web-fetch / web-search refuse private IPs through a shared, DNS-rebind-safe SSRF guard (loopback / link-local / multicast blocked).
Secrets are redacted before persistence. Vendor key patterns are scrubbed before storage; merlin redact-history re-applies current rules retroactively.
Spend caps refuse provider calls that would exceed daily USD budgets.
Data-deletion right. merlin wipe --confirm nukes session DB + memory + audit + keychain + cost tracking. Idempotent. Audits the wipe itself.
Dangerous tools need consent. Under --non-interactive, dangerous tools auto-deny; --allow / --deny globs let bridges scope safely.
Go deeper on Merlin's own site
The live benchmarks, full documentation, and the build blog stay on Merlin's Pages site. The marketing front-door lives here; the depth lives there.