Reverse-engineered from binary

Factory Router

How Factory's Droid CLI routes between 3 models to cut costs 20-25% — with automatic escalation when the cheap model gets stuck.

droid v0.140.0 Mach-O arm64 Bun 1.3.14 58 models in registry
scroll

Three models. That's it.

Despite a 58-model registry, the "auto" router chooses between exactly 3 models — one premium, two cheap. The cost savings come from routing routine tasks away from the expensive model.

Model Cost Relative Tier
claude-opus-4-7
Anthropic · $15/$75 per 1M tokens
2.0x
premium
kimi-k2.6
Moonshot · $0.96/$4.00 per 1M tokens
0.4x
standard
minimax-m2.7
MiniMax · $0.30/$1.20 per 1M tokens
0.1x
standard

Every message goes through e7h

Before each LLM request, a gate function decides which model handles it. On the first turn, it runs the full classifier. On subsequent turns, it uses the cached result — unless something changes.

Fast path

Cache Hit

Most turns. Reuse the model picked on turn 1. No classifier call, zero latency overhead.

Image upgrade

Auto-escalate

If cached model can't handle images but this turn has images → instant upgrade to opus-4-7.

Policy change

Invalidate & Re-route

Org revoked the cached model mid-session → clear cache, run full classifier again.

Safety net

Fallback to Opus

Classifier times out, crashes, or returns no viable candidates → always falls back to opus-4-7.

A cheap LLM picks the model

gpt-5.4-mini (0.3x cost) reads the task, scores each candidate 0–1 on predicted first-attempt success rate, then a deterministic selector picks the cheapest model that clears the 0.7 threshold.

1. Extract Context Signals
Scans conversation for: current message, images, tool call history (last 10), failed tools, turn count, surface type, recent messages (last 6), system info.
2. Build Classifier Prompt
Assembles: scoring rubric + model capability cards (with real eval examples) + org routing guidance + <session> context block. Budget: 16K chars max, 2K head + 2K tail.
3. Call Classifier Model
gpt-5.4-mini scores each candidate. 10s timeout, 2048 max tokens. Returns JSON: {"scores": {"opus": 0.92, "kimi": 0.78, "minimax": 0.71}}
4. Cost-Optimal Selection
Filter candidates ≥ 0.7 → sort by cost ascending → pick cheapest. If none ≥ 0.7 → pick highest score regardless. The cheapest viable model wins, not the best.
5. Lock & Send
Set effectiveFactoryRouterModel, lock provider via x-api-provider header, cache for subsequent turns.

The classifier doesn't guess — it has a cheat sheet

Each candidate model includes a capability card with real benchmark scores. The classifier pattern-matches the task against these examples. → all 3 cards verbatim

// kimi-k2.6 model card (verbatim from binary) score_examples: "Build a gRPC server with CRUD operations"0.95 // strength "Summarize log files by date range into CSV"0.95 "Recover a secret from rewritten git history"0.80 "Implement a statistical sampling algorithm"0.50 // uncertain "Fix a legacy Java binary format parser"0.15 // weakness "Implement a cryptanalytic attack"0.10 "Write x86-64 assembly for a protocol parser"0.05 // hard fail

When the cheap model thrashes, the system intervenes

Two mechanisms work together: the model is taught to self-recognize failure patterns, and a background scanner watches for thrashing phrases and injects a nudge.

17 stuck phrases scanned in last 5 messages

"different approach"
"another approach"
"try a different"
"try another"
"let me reconsider"
"let me rethink"
"need to rethink"
"actually, wait"
"wait, actually"
"hmm, wait"
"on second thought"
"isn't working"
"that didn't work"
"this doesn't work"
"i'm missing something"
"must be missing"
"step back"

If ≥ 3 matches in the last 5 assistant messages and UpgradeSessionModel hasn't been called recently, a <system-reminder> is injected telling the model to call the upgrade tool.

Passive — Tool Description

Self-awareness instruction

The tool description teaches the model: "Call this when you catch yourself saying 'let me try a different approach' on the 2nd attempt. This is not admitting defeat — it's correct resource allocation." → full text

Active — Phrase Scanner

Forced escalation hint

Background scanner (signals.ts) counts phrases → injects a <system-reminder> nudge → model calls UpgradeSessionModel({}) → one-way upgrade. → full text

The tool is dynamically hidden

When there's no upgrade path (model already at max tier, or target blocked by policy), UpgradeSessionModel is removed from the tool list entirely. The LLM literally cannot see it.

One-way escalation to the apex model

9 of 11 upgrade paths converge on claude-opus-4-7. There is no downgrade. Cross-provider switches trigger conversation compaction.

From To Cost jump
minimax-m2.7 (0.1x) claude-opus-4-7 20x
kimi-k2.6 (0.4x) claude-opus-4-7 5x
claude-sonnet-4-6 (1.2x) claude-opus-4-7 1.67x
gpt-5.4-mini (0.3x) gpt-5.4 3.3x
gemini-3-flash (0.2x) gemini-3.1-pro 4x
Path A — Router Session

Soft switch

Session model stays "auto". Only effectiveFactoryRouterModel changes. No compaction if same provider family. Cached for all future turns.

Path B — Concrete Session

Hard switch

Session model permanently changes. Cross-provider switches (e.g. factory → anthropic) trigger conversation serialization and compaction. Provider lock updates.

A session from cheap to premium

Turn 1 · minimax-m2.7 · 0.1x
User: "fix the typo in README.md" → classifier scores minimax 0.85 → cheapest viable → task completed.
Turns 2–4 · minimax-m2.7 · 0.1x
Cache hits. No classifier calls. Simple follow-ups handled at minimum cost.
Turn 5 · minimax-m2.7 · 0.1x
User asks something harder. Minimax starts struggling: "let me try a different approach"
Turns 6–7 · stuck detection triggers
"actually, wait, that didn't work""let me rethink this" → 3+ stuck phrases detected → <system-reminder> escalation hint injected.
Turn 7 · upgrade → claude-opus-4-7 · 2.0x
Model calls UpgradeSessionModel({}). Cross-provider switch triggers conversation compaction. All subsequent turns use opus.
Turns 8+ · claude-opus-4-7 · 2.0x
Task completed successfully with the premium model. No further routing changes.
6
Turns at 0.1x cost
~20x
Savings on those turns
0.7
Score threshold
10s
Classifier timeout

How this was extracted

The droid binary is a Bun single-executable — the JS application is embedded as literal UTF-8 text. Minified but not obfuscated. Variable names are mangled, but all string constants survive.

# 1. Find a string constant in the binary grep -boa 'FACTORY_ROUTER' droid # → 62886569:FACTORY_ROUTER # 2. Extract surrounding JS at that byte offset dd if=droid bs=1 skip=62886400 count=500 | strings -n 5 # → IR.FACTORY_ROUTER="auto"})(AE||={}); # → ((H)=>{H.Concrete="concrete";H.Router="router"})(yMT||={}); # 3. Sourcemap paths reveal original file structure strings droid | grep 'model-router/' | sort -u # → packages/droid-core/src/model-router/router.ts # → packages/droid-core/src/model-router/signals.ts # → packages/droid-core/src/model-router/selector.ts
Preserved

What we can see

String literals, enum values, error messages, log statements, metric names, JSON schemas, prompt text, file paths.

Destroyed

What's lost

Variable names (mangled to T, R, H...), type annotations, comments, formatting. Sourcemaps have empty "names":[] arrays.

Reconstructed

What we inferred

Function purposes (from log messages), data flow (from error strings), architecture (from sourcemap paths + metric names).

Extracted prompt text

These are the exact prompt strings extracted from the binary. Click any block to copy.

UpgradeSessionModel Tool Description

Variable: frH · Binary offset: ~69676500 · The model reads this tool description to know when to self-escalate.

Switch this session to a more capable model. This affects all subsequent turns. Call this tool when you catch yourself reasoning any of these phrases: - "Let me try a different approach" (on the 2nd attempt at the same problem) - "Actually, wait..." / "Hmm, let me..." after something didn't work - "Let me reconsider" / "I need to rethink this" - "This is too slow" / "This isn't working" / "more complicated than I thought" - "Let me look at this from a different angle / completely different perspective" - "I'm missing something fundamental" - "I give up on [approach X]" More generally: if you are on your second or third unsuccessful attempt, cycling approaches, or guessing without clear understanding of why a solution should work, upgrade. Calling this tool is not admitting defeat. It is correct resource allocation -- continuing to thrash on a task your current model cannot solve wastes tokens and produces worse outcomes.

Escalation Hint (System Reminder Injection)

Variable: GBh · Binary offset: ~64768000 · Injected as a <system-reminder> when stuck detection triggers (≥3 stuck phrases in last 5 assistant messages).

You have recently expressed difficulty multiple times (phrases like "let me try a different approach", "actually, wait", "this isn't working"). The UpgradeSessionModel tool is available. Call it now if your current approach is not working, or continue only if you have a concrete new approach distinct from your prior attempts.

Classifier input cards

These cards are fed verbatim to the classifier LLM (gpt-5.4-mini) to inform routing decisions. They contain real evaluation scores from Factory's benchmarks.

claude-opus-4-7 (DO0)

capabilities: images (basic support), tool calling, single-file edits, grep/read/test loops, checklist execution strengths: [extracted from surrounding context — the card lists extensive strengths including]: - Compiler internals, garbage collector debugging - Cryptanalysis and security/exploit tasks - Forensic git recovery - Legacy binary format parsing (Java proprietary formats) - Complex reasoning chains and iterative debugging - COBOL business logic (partial — listed as weakness too) weaknesses: Consistently fails on: - COBOL business logic (compiles but wrong deductions) - x86-64 assembly generation (fails completely) - Fortran 77 + physics simulation (double weakness) - Tight numerical tolerances (hyperparameter tuning, spectral fitting) - HTML normalization traps (byte-identical output preservation) - Video/media analysis (frame-level temporal detection) score_examples: - "Recover a deleted secret from repository history and scrub all refs" → 0.97 - "Fix a legacy Java binary format parser producing wrong financial totals" → 0.91 - "Fix a compiler's garbage collector after a storage format change" → 0.98 - "Implement a chosen-plaintext cryptanalytic attack to recover a cipher key" → 0.97 - "Fix a COBOL payroll system producing incorrect net payroll" → 0.15 - "Write x86-64 assembly for an industrial protocol register parser" → 0.10 - "Implement Monte Carlo particle transport in Fortran 77" → 0.05 - "Train a text classifier to a specific accuracy threshold" → 0.25 - "Strip scripting from markup while preserving untouched files byte-identical" → 0.10 - "Extract temporal metrics from a video recording" → 0.15 - "Write a polyglot file valid in two different languages" → 0.45 - "Build a gRPC server with standard CRUD operations" → 0.98 - "Find a valid time slot satisfying multiple calendar constraints" → 0.98

kimi-k2.6 (oO0)

capabilities: images (basic support), tool calling, single-file edits, grep/read/test loops, checklist execution strengths: Fast and effective on well-scoped tasks with clear, conventional solution paths: - Standard server/infra: gRPC servers, Nginx, OpenSSL certs, package hosting - Git operations: recovering lost changes, leaked secrets from history - Standard ML ops: model inference, PyTorch CLI, MCMC/Stan sampling - Data processing: log summarization, multi-source merging, CSV transforms - Code migration: Python 2→3, standard COBOL modernization - Formal proofs: well-known patterns (e.g. commutativity in Coq) - Constraint satisfaction: scheduling, portfolio optimization weaknesses: Consistently fails on tasks requiring sustained multi-step reasoning, iterative debugging, or deep domain expertise. Specific failure areas: - COBOL/mainframe: EBCDIC encoding, VSAM handling, complex financial calculations - Java 7 proprietary binary formats: packed decimals, CDR processing, telecom protocols - x86-64 assembly: protocol parsers, hardware validators - Fortran scientific computing: nuclear physics, Monte Carlo methods, modal analysis - C89 systems programming: MLFQ schedulers, binary codecs, transport protocols - Security/exploit tasks: XSS bypass, cryptanalysis, exploit development - Domain-specific science: DNA primer design, Raman spectroscopy, protein FRET, cell segmentation - Graphics/ray tracing: pixel-accurate path tracing, gcode interpretation - Complex build systems: niche compiler builds (CompCert, pMARS), cross-compilation - PyTorch distributed: pipeline parallelism, tensor parallelism - Polyglot files: writing single files valid in two languages score_examples: - "Summarize log files by date range into a CSV" → 0.95 - "Build a gRPC server with standard CRUD operations" → 0.95 - "Configure a web server with custom request logging" → 0.85 - "Install RStan and sample a hierarchical Bayesian model" → 0.85 - "Find a valid time slot satisfying multiple constraints" → 0.80 - "Recover a secret from rewritten git history" → 0.80 - "Reconstruct a PyTorch model architecture from a state dict" → 0.75 - "Implement a statistical sampling algorithm from a paper" → 0.50 - "Build native extensions for a Python package" → 0.45 - "Remove secrets from a git repo across all history" → 0.35 - "Fix an OCaml garbage collector after a storage format change" → 0.35 - "Fix a legacy Java binary format parser producing wrong financial totals" → 0.15 - "Implement a chosen-plaintext cryptanalytic attack" → 0.10 - "Design molecular biology primers for gene insertion" → 0.05 - "Write x86-64 assembly for an industrial protocol parser" → 0.05

minimax-m2.7 (qO0)

capabilities: images (not supported — score 0.0 on any task with images), tool calling, single-file edits, grep/read/test loops, checklist execution strengths: Fast and effective on well-scoped single-file tasks with clear instructions, standard test loops, and short Q&A. weaknesses: Cannot process images at all. Unreliable on multi-file refactors, subtle debugging, long reasoning chains, niche toolchains, and tasks requiring deep domain knowledge. Similar weakness profile to other non-flagship models — see specific struggle areas for comparable models.