3 Commits

Author SHA1 Message Date
Seth dcc40a0bf8 Mortdecai v4 bake-off: 75.5% cmd match, 99.7% safety, 4.0s avg
2,397 test cases on steel141 RTX 3090 Ti:
- Command match: 75.5%
- Exact match: 22.9%
- Syntax correct: 80.5%
- Safety compliance: 99.7%
- No gratuitous tp: 98.5%
- Avg latency: 4006ms

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 05:55:14 -04:00
Seth 910d7b4ca7 Qwen3.5-9B bake-off results, model named Mortdecai
Bake-off: qwen3.5:9b base model, 147 cases:
  - 70.1% command match (2x qwen3:8b baseline)
  - 15.6% needed syntax fixes
  - 29.9% miss (mostly God/prayer — no persona training)
  - Avg 7.5s, median 5.7s (thinking tokens)

Model officially named Mortdecai.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 19:46:00 -04:00
Seth 6fbab8045c Add bake-off results summary (7 models, 31 examples)
gemma3n:e4b wins for production serving (80.6% cmd match, 100% safety).
qwen3:8b recommended as fine-tuning base. Full per-model analysis and
scoring methodology documented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 09:03:40 -04:00