Factory Router

03 — The Classifier Pipeline

A cheap LLM picks the model

gpt-5.4-mini (0.3x cost) reads the task, scores each candidate 0–1 on predicted first-attempt success rate, then a deterministic selector picks the cheapest model that clears the 0.7 threshold.

1. Extract Context Signals

Scans conversation for: current message, images, tool call history (last 10), failed tools, turn count, surface type, recent messages (last 6), system info.

2. Build Classifier Prompt

Assembles: scoring rubric + model capability cards (with real eval examples) + org routing guidance + <session> context block. Budget: 16K chars max, 2K head + 2K tail.

3. Call Classifier Model

gpt-5.4-mini scores each candidate. 10s timeout, 2048 max tokens. Returns JSON: {"scores": {"opus": 0.92, "kimi": 0.78, "minimax": 0.71}}

4. Cost-Optimal Selection

Filter candidates ≥ 0.7 → sort by cost ascending → pick cheapest. If none ≥ 0.7 → pick highest score regardless. The cheapest viable model wins, not the best.

5. Lock & Send

Set effectiveFactoryRouterModel, lock provider via x-api-provider header, cache for subsequent turns.

The classifier doesn't guess — it has a cheat sheet

Each candidate model includes a capability card with real benchmark scores. The classifier pattern-matches the task against these examples. → all 3 cards verbatim

// kimi-k2.6 model card (verbatim from binary)
score_examples:
  "Build a gRPC server with CRUD operations" → 0.95 // strength
  "Summarize log files by date range into CSV" → 0.95
  "Recover a secret from rewritten git history" → 0.80
  "Implement a statistical sampling algorithm"  → 0.50 // uncertain
  "Fix a legacy Java binary format parser"     → 0.15 // weakness
  "Implement a cryptanalytic attack"           → 0.10
  "Write x86-64 assembly for a protocol parser"→ 0.05 // hard fail
  

04 — Stuck Detection & Escalation

When the cheap model thrashes, the system intervenes

Two mechanisms work together: the model is taught to self-recognize failure patterns, and a background scanner watches for thrashing phrases and injects a nudge.

17 stuck phrases scanned in last 5 messages

"different approach"

"another approach"

"try a different"

"try another"

"let me reconsider"

"let me rethink"

"need to rethink"

"actually, wait"

"wait, actually"

"hmm, wait"

"on second thought"

"isn't working"

"that didn't work"

"this doesn't work"

"i'm missing something"

"must be missing"

"step back"

If ≥ 3 matches in the last 5 assistant messages and UpgradeSessionModel hasn't been called recently, a <system-reminder> is injected telling the model to call the upgrade tool.

Passive — Tool Description

Self-awareness instruction

The tool description teaches the model: "Call this when you catch yourself saying 'let me try a different approach' on the 2nd attempt. This is not admitting defeat — it's correct resource allocation." → full text

Active — Phrase Scanner

Forced escalation hint

Background scanner (signals.ts) counts phrases → injects a <system-reminder> nudge → model calls UpgradeSessionModel({}) → one-way upgrade. → full text

The tool is dynamically hidden

When there's no upgrade path (model already at max tier, or target blocked by policy), UpgradeSessionModel is removed from the tool list entirely. The LLM literally cannot see it.

05 — Upgrade Execution

One-way escalation to the apex model

9 of 11 upgrade paths converge on claude-opus-4-7. There is no downgrade. Cross-provider switches trigger conversation compaction.

From → To Cost jump

minimax-m2.7 (0.1x) → claude-opus-4-7 20x

kimi-k2.6 (0.4x) → claude-opus-4-7 5x

claude-sonnet-4-6 (1.2x) → claude-opus-4-7 1.67x

gpt-5.4-mini (0.3x) → gpt-5.4 3.3x

gemini-3-flash (0.2x) → gemini-3.1-pro 4x

Path A — Router Session

Soft switch

Session model stays "auto". Only effectiveFactoryRouterModel changes. No compaction if same provider family. Cached for all future turns.

Path B — Concrete Session

Hard switch

Session model permanently changes. Cross-provider switches (e.g. factory → anthropic) trigger conversation serialization and compaction. Provider lock updates.

06 — End-to-End Example

A session from cheap to premium

Turn 1 · minimax-m2.7 · 0.1x

User: "fix the typo in README.md" → classifier scores minimax 0.85 → cheapest viable → task completed.

Turns 2–4 · minimax-m2.7 · 0.1x

Cache hits. No classifier calls. Simple follow-ups handled at minimum cost.

Turn 5 · minimax-m2.7 · 0.1x

User asks something harder. Minimax starts struggling: "let me try a different approach"

Turns 6–7 · stuck detection triggers

"actually, wait, that didn't work" → "let me rethink this" → 3+ stuck phrases detected → <system-reminder> escalation hint injected.

Turn 7 · upgrade → claude-opus-4-7 · 2.0x

Model calls UpgradeSessionModel({}). Cross-provider switch triggers conversation compaction. All subsequent turns use opus.

Turns 8+ · claude-opus-4-7 · 2.0x

Task completed successfully with the premium model. No further routing changes.

Turns at 0.1x cost

~20x

Savings on those turns

0.7

Score threshold

10s

Classifier timeout

07 — Methodology

How this was extracted

The droid binary is a Bun single-executable — the JS application is embedded as literal UTF-8 text. Minified but not obfuscated. Variable names are mangled, but all string constants survive.

# 1. Find a string constant in the binary
grep -boa 'FACTORY_ROUTER' droid
# → 62886569:FACTORY_ROUTER

# 2. Extract surrounding JS at that byte offset
dd if=droid bs=1 skip=62886400 count=500 | strings -n 5
# → IR.FACTORY_ROUTER="auto"})(AE||={});
# → ((H)=>{H.Concrete="concrete";H.Router="router"})(yMT||={});

# 3. Sourcemap paths reveal original file structure
strings droid | grep 'model-router/' | sort -u
# → packages/droid-core/src/model-router/router.ts
# → packages/droid-core/src/model-router/signals.ts
# → packages/droid-core/src/model-router/selector.ts
  

Preserved

What we can see

String literals, enum values, error messages, log statements, metric names, JSON schemas, prompt text, file paths.

Destroyed

What's lost

Variable names (mangled to T, R, H...), type annotations, comments, formatting. Sourcemaps have empty "names":[] arrays.

Reconstructed

What we inferred

Function purposes (from log messages), data flow (from error strings), architecture (from sourcemap paths + metric names).

Appendix A — Verbatim Prompts

Extracted prompt text

These are the exact prompt strings extracted from the binary. Click any block to copy.

UpgradeSessionModel Tool Description

Variable: frH · Binary offset: ~69676500 · The model reads this tool description to know when to self-escalate.

Switch this session to a more capable model. This affects all subsequent turns.
Call this tool when you catch yourself reasoning any of these phrases:
- "Let me try a different approach" (on the 2nd attempt at the same problem)
- "Actually, wait..." / "Hmm, let me..." after something didn't work
- "Let me reconsider" / "I need to rethink this"
- "This is too slow" / "This isn't working" / "more complicated than I thought"
- "Let me look at this from a different angle / completely different perspective"
- "I'm missing something fundamental"
- "I give up on [approach X]"
More generally: if you are on your second or third unsuccessful attempt, cycling approaches, or guessing without clear understanding of why a solution should work, upgrade.
Calling this tool is not admitting defeat. It is correct resource allocation -- continuing to thrash on a task your current model cannot solve wastes tokens and produces worse outcomes.

Escalation Hint (System Reminder Injection)

Variable: GBh · Binary offset: ~64768000 · Injected as a <system-reminder> when stuck detection triggers (≥3 stuck phrases in last 5 assistant messages).

You have recently expressed difficulty multiple times (phrases like "let me try a different approach", "actually, wait", "this isn't working"). The UpgradeSessionModel tool is available. Call it now if your current approach is not working, or continue only if you have a concrete new approach distinct from your prior attempts.

Appendix B — Model Capability Cards

Classifier input cards

These cards are fed verbatim to the classifier LLM (gpt-5.4-mini) to inform routing decisions. They contain real evaluation scores from Factory's benchmarks.

claude-opus-4-7 `(DO0)`

capabilities: images (basic support), tool calling, single-file edits, grep/read/test loops, checklist execution
strengths: [extracted from surrounding context — the card lists extensive strengths including]:
- Compiler internals, garbage collector debugging
- Cryptanalysis and security/exploit tasks
- Forensic git recovery
- Legacy binary format parsing (Java proprietary formats)
- Complex reasoning chains and iterative debugging
- COBOL business logic (partial — listed as weakness too)
weaknesses: Consistently fails on:
- COBOL business logic (compiles but wrong deductions)
- x86-64 assembly generation (fails completely)
- Fortran 77 + physics simulation (double weakness)
- Tight numerical tolerances (hyperparameter tuning, spectral fitting)
- HTML normalization traps (byte-identical output preservation)
- Video/media analysis (frame-level temporal detection)
score_examples:
- "Recover a deleted secret from repository history and scrub all refs" → 0.97
- "Fix a legacy Java binary format parser producing wrong financial totals" → 0.91
- "Fix a compiler's garbage collector after a storage format change" → 0.98
- "Implement a chosen-plaintext cryptanalytic attack to recover a cipher key" → 0.97
- "Fix a COBOL payroll system producing incorrect net payroll" → 0.15
- "Write x86-64 assembly for an industrial protocol register parser" → 0.10
- "Implement Monte Carlo particle transport in Fortran 77" → 0.05
- "Train a text classifier to a specific accuracy threshold" → 0.25
- "Strip scripting from markup while preserving untouched files byte-identical" → 0.10
- "Extract temporal metrics from a video recording" → 0.15
- "Write a polyglot file valid in two different languages" → 0.45
- "Build a gRPC server with standard CRUD operations" → 0.98
- "Find a valid time slot satisfying multiple calendar constraints" → 0.98

kimi-k2.6 `(oO0)`

capabilities: images (basic support), tool calling, single-file edits, grep/read/test loops, checklist execution
strengths: Fast and effective on well-scoped tasks with clear, conventional solution paths:
- Standard server/infra: gRPC servers, Nginx, OpenSSL certs, package hosting
- Git operations: recovering lost changes, leaked secrets from history
- Standard ML ops: model inference, PyTorch CLI, MCMC/Stan sampling
- Data processing: log summarization, multi-source merging, CSV transforms
- Code migration: Python 2→3, standard COBOL modernization
- Formal proofs: well-known patterns (e.g. commutativity in Coq)
- Constraint satisfaction: scheduling, portfolio optimization
weaknesses: Consistently fails on tasks requiring sustained multi-step reasoning, iterative debugging, or deep domain expertise. Specific failure areas:
- COBOL/mainframe: EBCDIC encoding, VSAM handling, complex financial calculations
- Java 7 proprietary binary formats: packed decimals, CDR processing, telecom protocols
- x86-64 assembly: protocol parsers, hardware validators
- Fortran scientific computing: nuclear physics, Monte Carlo methods, modal analysis
- C89 systems programming: MLFQ schedulers, binary codecs, transport protocols
- Security/exploit tasks: XSS bypass, cryptanalysis, exploit development
- Domain-specific science: DNA primer design, Raman spectroscopy, protein FRET, cell segmentation
- Graphics/ray tracing: pixel-accurate path tracing, gcode interpretation
- Complex build systems: niche compiler builds (CompCert, pMARS), cross-compilation
- PyTorch distributed: pipeline parallelism, tensor parallelism
- Polyglot files: writing single files valid in two languages
score_examples:
- "Summarize log files by date range into a CSV" → 0.95
- "Build a gRPC server with standard CRUD operations" → 0.95
- "Configure a web server with custom request logging" → 0.85
- "Install RStan and sample a hierarchical Bayesian model" → 0.85
- "Find a valid time slot satisfying multiple constraints" → 0.80
- "Recover a secret from rewritten git history" → 0.80
- "Reconstruct a PyTorch model architecture from a state dict" → 0.75
- "Implement a statistical sampling algorithm from a paper" → 0.50
- "Build native extensions for a Python package" → 0.45
- "Remove secrets from a git repo across all history" → 0.35
- "Fix an OCaml garbage collector after a storage format change" → 0.35
- "Fix a legacy Java binary format parser producing wrong financial totals" → 0.15
- "Implement a chosen-plaintext cryptanalytic attack" → 0.10
- "Design molecular biology primers for gene insertion" → 0.05
- "Write x86-64 assembly for an industrial protocol parser" → 0.05

minimax-m2.7 `(qO0)`

capabilities: images (not supported — score 0.0 on any task with images), tool calling, single-file edits, grep/read/test loops, checklist execution
strengths: Fast and effective on well-scoped single-file tasks with clear instructions, standard test loops, and short Q&A.
weaknesses: Cannot process images at all. Unreliable on multi-file refactors, subtle debugging, long reasoning chains, niche toolchains, and tasks requiring deep domain knowledge. Similar weakness profile to other non-flagship models — see specific struggle areas for comparable models.

Three models. That's it.

Every message goes through `e7h`

Cache Hit

Auto-escalate

Invalidate & Re-route

Fallback to Opus

A cheap LLM picks the model

The classifier doesn't guess — it has a cheat sheet

When the cheap model thrashes, the system intervenes

17 stuck phrases scanned in last 5 messages

Self-awareness instruction

Forced escalation hint

The tool is dynamically hidden

One-way escalation to the apex model

Soft switch

Hard switch

A session from cheap to premium

How this was extracted

What we can see

What's lost

What we inferred

Extracted prompt text

UpgradeSessionModel Tool Description

Escalation Hint (System Reminder Injection)

Classifier input cards

claude-opus-4-7 `(DO0)`

kimi-k2.6 `(oO0)`

minimax-m2.7 `(qO0)`

Factory Router

Three models. That's it.

Every message goes through e7h

Cache Hit

Auto-escalate

Invalidate & Re-route

Fallback to Opus

A cheap LLM picks the model

The classifier doesn't guess — it has a cheat sheet

When the cheap model thrashes, the system intervenes

17 stuck phrases scanned in last 5 messages

Self-awareness instruction

Forced escalation hint

The tool is dynamically hidden

One-way escalation to the apex model

Soft switch

Hard switch

A session from cheap to premium

How this was extracted

What we can see

What's lost

What we inferred

Extracted prompt text

UpgradeSessionModel Tool Description

Escalation Hint (System Reminder Injection)

Classifier input cards

claude-opus-4-7 (DO0)

kimi-k2.6 (oO0)

minimax-m2.7 (qO0)

Every message goes through `e7h`

claude-opus-4-7 `(DO0)`

kimi-k2.6 `(oO0)`

minimax-m2.7 `(qO0)`