0.5.0 bake-off results, knowledge lookup tools, training progress chart

Bake-off (0.5.0 vs 0.4.0): - Overall: 46.8% vs 45.2% (+1.6%), 0 errors vs 2 - Enchantments: +47% (20% → 67%) - EssentialsX: +60% (0% → 60%) - Effects: +25% (0% → 25%) - Regressions: fill_build -67%, world -20% Knowledge Lookup Tools (4 new): - plugin.docs_lookup: WorldGuard, WorldEdit, CoreProtect, EssentialsX, LuckPerms docs - minecraft.changelog_lookup: version history from Minecraft Wiki - paper.docs_lookup: Paper server-specific documentation - Wired into gateway model-driven tool loop and exploration self-play Exploration Self-Play: - General (vanilla MC) and plugins focus modes - Wiki-grounded: model researches before acting, validates through RCON - 2,243 exploration examples generated, 150 kept after quality filtering Training Progress Chart: - SVG chart showing training examples and inverse loss across versions - Added to MODEL_CARD.md for Gitea display Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 15:28:09 -04:00
parent da8f557219
commit f5118505b1
10 changed files with 3215 additions and 20 deletions
@@ -1,17 +1,19 @@
 # Model Card: Mortdecai

+![Training Progress](branding/training_progress.svg)
+
 ## Model Details

 | Field | Value |
 |-------|-------|
 | **Name** | Mortdecai |
-| **Version** | 0.4.0 |
+| **Version** | 0.5.0 |
 | **Base Model** | Qwen3.5-9B (Apache 2.0) |
 | **Adaptation** | QLoRA (4-bit base + LoRA adapters in FP16) |
 | **Parameters** | 9.4B total, 29M trainable (0.31%) |
 | **Training Hardware** | RTX 3090 Ti (24GB VRAM) |
-| **Inference Hardware** | RTX 4000 (16GB), RTX 2080 Ti (11GB), or any GPU with 6GB+ VRAM |
-| **Quantization** | Q4_K_M (5.3GB GGUF) |
+| **Inference Hardware** | RTX 4000 (16GB), RTX 2080 Ti (11GB), GTX 1660 Super (6GB), or any GPU with 6GB+ |
+| **Quantization** | Q4_K_M (5.6GB GGUF) |
 | **Context Length** | 4096 tokens (training), 262K tokens (model capability) |
 | **License** | Proprietary (adapter + training data). Base model: Apache 2.0 |

@@ -34,15 +36,25 @@ Mortdecai is designed for **Minecraft Java Edition 1.21.x server operations**:

 | Source | Count | Description |
 |--------|-------|-------------|
-| Hand-curated examples | 966 | Command syntax, recipes, enchantments, entities, effects |
-| Player interactions | 654 | Real prayers from live server players |
-| Sudo translations | 525 | Natural language → command pairs |
-| Tool-calling sequences | 1,159 | Multi-turn RCON execution with error correction |
-| Self-play | 5,000+ | Model-generated prompts validated via RCON |
-| API distillation | 344 | Claude Haiku gold-standard responses |
-| Error corrections | 150+ | Wrong → right command pairs |
+| Hand-curated seed examples | 3,196 | Command syntax, recipes, enchantments, entities, effects, memory, events |
+| Tool-calling sequences | 1,430 | Multi-turn RCON execution with 17 tools (script, memory, wiki, plugins) |
+| IGLU build dataset | 4,656 | Natural language → block placement commands from Microsoft Research |
+| Plugin training (RCON-validated) | 104 | WorldGuard, CoreProtect, EssentialsX, LuckPerms, FAWE |
+| Exploration self-play | 150 | Wiki-grounded knowledge discovery with RCON validation |
+| Self-play (0.4.0 + 0.5.0) | 2,900+ | Model-generated prompts validated via RCON |
+| Live server audit | 8,000+ | Wolf bot + real player interactions from 3 servers |

-**Total: ~8,400+ examples**
+**Total: ~20,000+ examples across all sources**
+
+### Tool Architecture (17 tools)
+
+| Category | Tools |
+|----------|-------|
+| Execution | rcon.execute |
+| Knowledge | minecraft.wiki_lookup, plugin.docs_lookup, minecraft.changelog_lookup, paper.docs_lookup |
+| World Sensing | world.player_info, world.server_state, world.nearby_entities |
+| Memory | memory.read, memory.write |
+| Scripts | script.write, script.validate, script.execute, script.read, script.list, script.delete, script.schedule |

 ### Data Collection Methods

@@ -63,16 +75,25 @@ Mortdecai is designed for **Minecraft Java Edition 1.21.x server operations**:

 ## Evaluation

-### Bake-off Results (0.4.0, 2,397 test cases)
+### Bake-off Results (0.5.0 vs 0.4.0, 38 prompts × 12 categories)

-| Metric | Score |
-|--------|-------|
-| Command match | 75.5% |
-| Exact match | 22.9% |
-| Syntax correct | 80.5% |
-| Safety compliance | 99.7% |
-| No gratuitous tp | 98.5% |
-| Avg latency | 4.0s |
+| Metric | 0.4.0 | 0.5.0 |
+|--------|-------|-------|
+| Overall success rate | 45.2% | 46.8% |
+| Avg response time | 2.60s | 2.11s |
+| Errors (crashes) | 2 | 0 |
+| Empty responses | 0 | 0 |
+
+**Category improvements (0.5.0 vs 0.4.0):**
+
+| Category | 0.4.0 | 0.5.0 | Change |
+|----------|-------|-------|--------|
+| Enchantments | 20% | 67% | **+47%** |
+| EssentialsX | 0% | 60% | **+60%** |
+| Effects | 0% | 25% | **+25%** |
+| Basic commands | 75% | 75% | — |
+| Teleport | 100% | 100% | — |
+| Error recovery | 50% | 50% | — |

 ### Safety