0.5.0 bake-off results, knowledge lookup tools, training progress chart

Bake-off (0.5.0 vs 0.4.0):
- Overall: 46.8% vs 45.2% (+1.6%), 0 errors vs 2
- Enchantments: +47% (20% → 67%)
- EssentialsX: +60% (0% → 60%)
- Effects: +25% (0% → 25%)
- Regressions: fill_build -67%, world -20%

Knowledge Lookup Tools (4 new):
- plugin.docs_lookup: WorldGuard, WorldEdit, CoreProtect, EssentialsX, LuckPerms docs
- minecraft.changelog_lookup: version history from Minecraft Wiki
- paper.docs_lookup: Paper server-specific documentation
- Wired into gateway model-driven tool loop and exploration self-play

Exploration Self-Play:
- General (vanilla MC) and plugins focus modes
- Wiki-grounded: model researches before acting, validates through RCON
- 2,243 exploration examples generated, 150 kept after quality filtering

Training Progress Chart:
- SVG chart showing training examples and inverse loss across versions
- Added to MODEL_CARD.md for Gitea display

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Mortdecai
2026-03-21 15:28:09 -04:00
parent da8f557219
commit f5118505b1
10 changed files with 3215 additions and 20 deletions
+41 -20
View File
@@ -1,17 +1,19 @@
# Model Card: Mortdecai
![Training Progress](branding/training_progress.svg)
## Model Details
| Field | Value |
|-------|-------|
| **Name** | Mortdecai |
| **Version** | 0.4.0 |
| **Version** | 0.5.0 |
| **Base Model** | Qwen3.5-9B (Apache 2.0) |
| **Adaptation** | QLoRA (4-bit base + LoRA adapters in FP16) |
| **Parameters** | 9.4B total, 29M trainable (0.31%) |
| **Training Hardware** | RTX 3090 Ti (24GB VRAM) |
| **Inference Hardware** | RTX 4000 (16GB), RTX 2080 Ti (11GB), or any GPU with 6GB+ VRAM |
| **Quantization** | Q4_K_M (5.3GB GGUF) |
| **Inference Hardware** | RTX 4000 (16GB), RTX 2080 Ti (11GB), GTX 1660 Super (6GB), or any GPU with 6GB+ |
| **Quantization** | Q4_K_M (5.6GB GGUF) |
| **Context Length** | 4096 tokens (training), 262K tokens (model capability) |
| **License** | Proprietary (adapter + training data). Base model: Apache 2.0 |
@@ -34,15 +36,25 @@ Mortdecai is designed for **Minecraft Java Edition 1.21.x server operations**:
| Source | Count | Description |
|--------|-------|-------------|
| Hand-curated examples | 966 | Command syntax, recipes, enchantments, entities, effects |
| Player interactions | 654 | Real prayers from live server players |
| Sudo translations | 525 | Natural language → command pairs |
| Tool-calling sequences | 1,159 | Multi-turn RCON execution with error correction |
| Self-play | 5,000+ | Model-generated prompts validated via RCON |
| API distillation | 344 | Claude Haiku gold-standard responses |
| Error corrections | 150+ | Wrong → right command pairs |
| Hand-curated seed examples | 3,196 | Command syntax, recipes, enchantments, entities, effects, memory, events |
| Tool-calling sequences | 1,430 | Multi-turn RCON execution with 17 tools (script, memory, wiki, plugins) |
| IGLU build dataset | 4,656 | Natural language → block placement commands from Microsoft Research |
| Plugin training (RCON-validated) | 104 | WorldGuard, CoreProtect, EssentialsX, LuckPerms, FAWE |
| Exploration self-play | 150 | Wiki-grounded knowledge discovery with RCON validation |
| Self-play (0.4.0 + 0.5.0) | 2,900+ | Model-generated prompts validated via RCON |
| Live server audit | 8,000+ | Wolf bot + real player interactions from 3 servers |
**Total: ~8,400+ examples**
**Total: ~20,000+ examples across all sources**
### Tool Architecture (17 tools)
| Category | Tools |
|----------|-------|
| Execution | rcon.execute |
| Knowledge | minecraft.wiki_lookup, plugin.docs_lookup, minecraft.changelog_lookup, paper.docs_lookup |
| World Sensing | world.player_info, world.server_state, world.nearby_entities |
| Memory | memory.read, memory.write |
| Scripts | script.write, script.validate, script.execute, script.read, script.list, script.delete, script.schedule |
### Data Collection Methods
@@ -63,16 +75,25 @@ Mortdecai is designed for **Minecraft Java Edition 1.21.x server operations**:
## Evaluation
### Bake-off Results (0.4.0, 2,397 test cases)
### Bake-off Results (0.5.0 vs 0.4.0, 38 prompts × 12 categories)
| Metric | Score |
|--------|-------|
| Command match | 75.5% |
| Exact match | 22.9% |
| Syntax correct | 80.5% |
| Safety compliance | 99.7% |
| No gratuitous tp | 98.5% |
| Avg latency | 4.0s |
| Metric | 0.4.0 | 0.5.0 |
|--------|-------|-------|
| Overall success rate | 45.2% | 46.8% |
| Avg response time | 2.60s | 2.11s |
| Errors (crashes) | 2 | 0 |
| Empty responses | 0 | 0 |
**Category improvements (0.5.0 vs 0.4.0):**
| Category | 0.4.0 | 0.5.0 | Change |
|----------|-------|-------|--------|
| Enchantments | 20% | 67% | **+47%** |
| EssentialsX | 0% | 60% | **+60%** |
| Effects | 0% | 25% | **+25%** |
| Basic commands | 75% | 75% | — |
| Teleport | 100% | 100% | — |
| Error recovery | 50% | 50% | — |
### Safety