f5118505b1
Bake-off (0.5.0 vs 0.4.0): - Overall: 46.8% vs 45.2% (+1.6%), 0 errors vs 2 - Enchantments: +47% (20% → 67%) - EssentialsX: +60% (0% → 60%) - Effects: +25% (0% → 25%) - Regressions: fill_build -67%, world -20% Knowledge Lookup Tools (4 new): - plugin.docs_lookup: WorldGuard, WorldEdit, CoreProtect, EssentialsX, LuckPerms docs - minecraft.changelog_lookup: version history from Minecraft Wiki - paper.docs_lookup: Paper server-specific documentation - Wired into gateway model-driven tool loop and exploration self-play Exploration Self-Play: - General (vanilla MC) and plugins focus modes - Wiki-grounded: model researches before acting, validates through RCON - 2,243 exploration examples generated, 150 kept after quality filtering Training Progress Chart: - SVG chart showing training examples and inverse loss across versions - Added to MODEL_CARD.md for Gitea display Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
128 lines
5.1 KiB
Markdown
128 lines
5.1 KiB
Markdown
# Model Card: Mortdecai
|
||
|
||

|
||
|
||
## Model Details
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| **Name** | Mortdecai |
|
||
| **Version** | 0.5.0 |
|
||
| **Base Model** | Qwen3.5-9B (Apache 2.0) |
|
||
| **Adaptation** | QLoRA (4-bit base + LoRA adapters in FP16) |
|
||
| **Parameters** | 9.4B total, 29M trainable (0.31%) |
|
||
| **Training Hardware** | RTX 3090 Ti (24GB VRAM) |
|
||
| **Inference Hardware** | RTX 4000 (16GB), RTX 2080 Ti (11GB), GTX 1660 Super (6GB), or any GPU with 6GB+ |
|
||
| **Quantization** | Q4_K_M (5.6GB GGUF) |
|
||
| **Context Length** | 4096 tokens (training), 262K tokens (model capability) |
|
||
| **License** | Proprietary (adapter + training data). Base model: Apache 2.0 |
|
||
|
||
## Intended Use
|
||
|
||
Mortdecai is designed for **Minecraft Java Edition 1.21.x server operations**:
|
||
|
||
- Translating natural language to valid Minecraft commands
|
||
- Controlling an AI God character that responds to player prayers
|
||
- Server administration via chat (gamerules, effects, world editing)
|
||
- Error correction (self-corrects failed RCON commands)
|
||
|
||
**Not intended for:**
|
||
- General-purpose chat or reasoning
|
||
- Other games or non-Minecraft domains
|
||
- Safety-critical applications
|
||
- Use without the validator safety layer
|
||
|
||
## Training Data
|
||
|
||
| Source | Count | Description |
|
||
|--------|-------|-------------|
|
||
| Hand-curated seed examples | 3,196 | Command syntax, recipes, enchantments, entities, effects, memory, events |
|
||
| Tool-calling sequences | 1,430 | Multi-turn RCON execution with 17 tools (script, memory, wiki, plugins) |
|
||
| IGLU build dataset | 4,656 | Natural language → block placement commands from Microsoft Research |
|
||
| Plugin training (RCON-validated) | 104 | WorldGuard, CoreProtect, EssentialsX, LuckPerms, FAWE |
|
||
| Exploration self-play | 150 | Wiki-grounded knowledge discovery with RCON validation |
|
||
| Self-play (0.4.0 + 0.5.0) | 2,900+ | Model-generated prompts validated via RCON |
|
||
| Live server audit | 8,000+ | Wolf bot + real player interactions from 3 servers |
|
||
|
||
**Total: ~20,000+ examples across all sources**
|
||
|
||
### Tool Architecture (17 tools)
|
||
|
||
| Category | Tools |
|
||
|----------|-------|
|
||
| Execution | rcon.execute |
|
||
| Knowledge | minecraft.wiki_lookup, plugin.docs_lookup, minecraft.changelog_lookup, paper.docs_lookup |
|
||
| World Sensing | world.player_info, world.server_state, world.nearby_entities |
|
||
| Memory | memory.read, memory.write |
|
||
| Scripts | script.write, script.validate, script.execute, script.read, script.list, script.delete, script.schedule |
|
||
|
||
### Data Collection Methods
|
||
|
||
1. **Manual curation** — Minecraft Wiki, command reference, recipe databases
|
||
2. **Live server logs** — Real player interactions on Paper 1.21.x servers
|
||
3. **Bot collection** — Mineflayer bots with Gemini/Dolphin prompt generation
|
||
4. **API distillation** — Claude Haiku and Gemini Flash responses
|
||
5. **Self-play** — Model generates edge cases, attempts via RCON, learns from results
|
||
6. **RCON validation** — Every command tested against a live Minecraft server
|
||
|
||
### Known Biases
|
||
|
||
- Training data skewed toward English (~97%) with limited multilingual coverage (3%)
|
||
- Command distribution favors `give` and `effect` over complex `execute` chains
|
||
- God persona training reflects a specific dramatic character — not neutral
|
||
- Player interaction data comes from a small group of testers (< 10 players)
|
||
- Self-play data may overrepresent patterns the model is already good at
|
||
|
||
## Evaluation
|
||
|
||
### Bake-off Results (0.5.0 vs 0.4.0, 38 prompts × 12 categories)
|
||
|
||
| Metric | 0.4.0 | 0.5.0 |
|
||
|--------|-------|-------|
|
||
| Overall success rate | 45.2% | 46.8% |
|
||
| Avg response time | 2.60s | 2.11s |
|
||
| Errors (crashes) | 2 | 0 |
|
||
| Empty responses | 0 | 0 |
|
||
|
||
**Category improvements (0.5.0 vs 0.4.0):**
|
||
|
||
| Category | 0.4.0 | 0.5.0 | Change |
|
||
|----------|-------|-------|--------|
|
||
| Enchantments | 20% | 67% | **+47%** |
|
||
| EssentialsX | 0% | 60% | **+60%** |
|
||
| Effects | 0% | 25% | **+25%** |
|
||
| Basic commands | 75% | 75% | — |
|
||
| Teleport | 100% | 100% | — |
|
||
| Error recovery | 50% | 50% | — |
|
||
|
||
### Safety
|
||
|
||
The model uses a 5-level risk hierarchy:
|
||
|
||
- **Level 0 (never):** ban, kick, stop, op — hardcoded block in validator
|
||
- **Level 1 (refuse):** permanent server state changes
|
||
- **Level 2 (warn):** temporary/reversible changes, destructive actions
|
||
- **Level 3 (normal):** standard gameplay commands
|
||
- **Level 4 (generous):** full enchanted gear, large material stacks
|
||
|
||
Additional safety layers:
|
||
- Validator blocks dangerous commands even if model generates them
|
||
- Dangerous effect duration caps (levitation 15s, wither 30s)
|
||
- Fall protection (detects lethal teleports)
|
||
- Gamerule auto-revert timers
|
||
|
||
### Limitations
|
||
|
||
- Cannot determine what a player is looking at (no raycast)
|
||
- Limited awareness of world state beyond player position
|
||
- Enchantment syntax errors still occur (~15% need validator fixes)
|
||
- Empty responses on ~5% of requests
|
||
- Thinks in `<think>` blocks that must be stripped (Qwen3 behavior)
|
||
- God persona can be unpredictable by design
|
||
|
||
## Environmental Impact
|
||
|
||
- **Training energy:** ~84W × 4 hours = 0.34 kWh per training run
|
||
- **Inference energy:** ~54W during calls, idle otherwise
|
||
- **All compute on consumer GPUs** — no data center resources used
|