How Factory's Droid CLI routes between 3 models to cut costs 20-25% — with automatic escalation when the cheap model gets stuck.
Despite a 58-model registry, the "auto" router chooses between exactly 3 models — one premium, two cheap. The cost savings come from routing routine tasks away from the expensive model.
e7hBefore each LLM request, a gate function decides which model handles it. On the first turn, it runs the full classifier. On subsequent turns, it uses the cached result — unless something changes.
Most turns. Reuse the model picked on turn 1. No classifier call, zero latency overhead.
If cached model can't handle images but this turn has images → instant upgrade to opus-4-7.
Org revoked the cached model mid-session → clear cache, run full classifier again.
Classifier times out, crashes, or returns no viable candidates → always falls back to opus-4-7.
gpt-5.4-mini (0.3x cost) reads the task, scores each candidate 0–1 on predicted first-attempt success rate, then a deterministic selector picks the cheapest model that clears the 0.7 threshold.
<session> context block. Budget: 16K chars max, 2K head + 2K tail.gpt-5.4-mini scores each candidate. 10s timeout, 2048 max tokens. Returns JSON: {"scores": {"opus": 0.92, "kimi": 0.78, "minimax": 0.71}}effectiveFactoryRouterModel, lock provider via x-api-provider header, cache for subsequent turns.Each candidate model includes a capability card with real benchmark scores. The classifier pattern-matches the task against these examples. → all 3 cards verbatim
Two mechanisms work together: the model is taught to self-recognize failure patterns, and a background scanner watches for thrashing phrases and injects a nudge.
If ≥ 3 matches in the last 5 assistant messages and UpgradeSessionModel hasn't been called recently, a <system-reminder> is injected telling the model to call the upgrade tool.
The tool description teaches the model: "Call this when you catch yourself saying 'let me try a different approach' on the 2nd attempt. This is not admitting defeat — it's correct resource allocation." → full text
Background scanner (signals.ts) counts phrases → injects a <system-reminder> nudge → model calls UpgradeSessionModel({}) → one-way upgrade. → full text
When there's no upgrade path (model already at max tier, or target blocked by policy), UpgradeSessionModel is removed from the tool list entirely. The LLM literally cannot see it.
9 of 11 upgrade paths converge on claude-opus-4-7. There is no downgrade. Cross-provider switches trigger conversation compaction.
Session model stays "auto". Only effectiveFactoryRouterModel changes. No compaction if same provider family. Cached for all future turns.
Session model permanently changes. Cross-provider switches (e.g. factory → anthropic) trigger conversation serialization and compaction. Provider lock updates.
<system-reminder> escalation hint injected.UpgradeSessionModel({}). Cross-provider switch triggers conversation compaction. All subsequent turns use opus.The droid binary is a Bun single-executable — the JS application is embedded as literal UTF-8 text. Minified but not obfuscated. Variable names are mangled, but all string constants survive.
String literals, enum values, error messages, log statements, metric names, JSON schemas, prompt text, file paths.
Variable names (mangled to T, R, H...), type annotations, comments, formatting. Sourcemaps have empty "names":[] arrays.
Function purposes (from log messages), data flow (from error strings), architecture (from sourcemap paths + metric names).
These are the exact prompt strings extracted from the binary. Click any block to copy.
Variable: frH · Binary offset: ~69676500 · The model reads this tool description to know when to self-escalate.
Variable: GBh · Binary offset: ~64768000 · Injected as a <system-reminder> when stuck detection triggers (≥3 stuck phrases in last 5 assistant messages).
These cards are fed verbatim to the classifier LLM (gpt-5.4-mini) to inform routing decisions. They contain real evaluation scores from Factory's benchmarks.
(DO0)(oO0)(qO0)