PLAN.md complete rewrite — Mortdecai project status, TODOs, risk hierarchy
Full rewrite reflecting current state: - Model history v1→v4, infrastructure map, API spend - Training data breakdown (3,477 total examples) - Active TODOs: immediate, short-term, v5, infrastructure, community - Risk hierarchy with permanence-based levels - Key architecture decisions log - Success criteria: v3 actual → v4 target → v5 goal - Single-call enabled on prod (mortdecai-v3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,445 +1,181 @@
|
||||
# PLAN.md -- Project Roadmap (Live Document)
|
||||
# PLAN.md — Mortdecai Project Roadmap
|
||||
|
||||
> **Last updated:** 2026-03-18 (rev 2)
|
||||
> **Last updated:** 2026-03-20
|
||||
> **Model name:** Mortdecai
|
||||
> **Domain:** mortdec.ai
|
||||
> **Status legend:** `[ ]` planned | `[~]` in progress | `[x]` done | `[-]` cancelled/deferred
|
||||
|
||||
---
|
||||
|
||||
## 0. Vision
|
||||
## Vision
|
||||
|
||||
Build a lightweight, Minecraft-focused AI assistant by adapting `qwen3-coder` (LoRA/SFT). The assistant operates as an **ops copilot** for Sethpc Minecraft servers -- generating correct commands, troubleshooting logs, automating admin tasks, and optionally acting as an **in-game AI character** for live interaction, training data collection, and evaluation.
|
||||
**Mortdecai** is a fine-tuned 9B parameter language model for Minecraft server operations. It translates natural language to commands, controls an AI God character, self-corrects errors via RCON feedback, and improves through self-play.
|
||||
|
||||
This is **not** a gameplay agent (like Voyager/MineDojo). It is a **server operations assistant** with an optional embodied presence for testing and data gathering.
|
||||
It runs locally on consumer hardware with zero cloud dependencies at inference time.
|
||||
|
||||
---
|
||||
|
||||
## 1. Prior Art & Inspirations
|
||||
## Current State (2026-03-20)
|
||||
|
||||
These projects informed the plan but solve different problems:
|
||||
### Models
|
||||
| Model | Base | Examples | Loss | Status |
|
||||
|-------|------|---------|------|--------|
|
||||
| v1 | Qwen3-8B | 233 | 0.10 | Retired (overfit) |
|
||||
| v2 | Qwen3-8B | 361 | 2.03 | Retired |
|
||||
| v3 | Qwen3-8B | 1,308 | 0.55 | **Deployed on prod** |
|
||||
| **v4** | **Qwen3.5-9B** | **3,369** | **Training (~88%)** | ETA ~30 min |
|
||||
|
||||
| Project | What it does | What we borrow |
|
||||
|---------|-------------|----------------|
|
||||
| **Voyager** (6.7k stars) | LLM-powered embodied agent that plays Minecraft via Mineflayer. Skill library + auto-curriculum + iterative prompting. | Skill library concept (reusable verified command sequences). Iterative self-verification loop for command correctness. |
|
||||
| **MineDojo** (2.2k stars) | RL/LLM research framework with 3142 tasks. Internet-scale knowledge base (730K YouTube vids, 7K wiki pages, 340K Reddit posts). | Knowledge corpus pipeline -- scraping wiki.vg and Minecraft Wiki for command syntax reference data. Task-based evaluation structure. |
|
||||
| **Mindcraft** (4.9k stars) | LLM + Mineflayer in-game bots with profiles, multi-agent collab. Supports Ollama, many APIs. | Profile-based bot architecture. In-game chat integration pattern. Ollama local model support. Provides own fine-tuned models (`sweaterdog/andy-4`). |
|
||||
| **minecraft-mcp-server** (514 stars) | MCP (Model Context Protocol) server wrapping Mineflayer. Lets Claude/LLMs control a Minecraft character via tool calls. | MCP tool-call interface for in-game actions. Could be adapted for our eval harness. |
|
||||
| **Mineflayer** (6.7k stars) | Node.js Minecraft bot framework. Supports 1.8-1.21.11. Movement, inventory, chat, block interaction. | Primary framework for in-game AI character. Mature, well-maintained, 1.21 support confirmed. |
|
||||
| **Existing AI God system** (our own) | Log-tail + RCON + Ollama pipeline. `pray` trigger, divine intervention, command validation, syntax repair. Vanilla + Paper fork. | Direct predecessor. Baseline to measure against. Source of real training data (prayer logs, bug reports). |
|
||||
### Infrastructure
|
||||
| Component | Location | Details |
|
||||
|-----------|----------|---------|
|
||||
| Training GPU | steel141 RTX 3090 Ti (24GB) | QLoRA via Unsloth |
|
||||
| Prod inference | node-197 RTX 4000 (16GB) | Ollama, mortdecai-v3 |
|
||||
| Minecraft servers | CT 644 on node-112 | paper-ai (25567), shrink-world (25566), dev (25568), vanilla (25565) |
|
||||
| Dev data collection | CT 644 | Gemini 2.5 Flash via API, 10 bots |
|
||||
| Whitelist app | CT 644 port 8099 | minecraft.mortdec.ai |
|
||||
| Caddy proxy | CT 600 on node-241 | mortdec.ai, minecraft.mortdec.ai |
|
||||
| GPU monitoring | Grafana on CT 300 (node-173) | Prometheus + nvidia exporter on steel141 |
|
||||
| LangGraph gateway | CT 644 port 8091 | Disabled on prod (fresh session mode available) |
|
||||
|
||||
### API Spend
|
||||
| Provider | Spent | Budget | Status |
|
||||
|----------|-------|--------|--------|
|
||||
| Claude Haiku | $20.01 | $20 | Exhausted |
|
||||
| Gemini 2.5 Flash | ~$0.50 | $20 | Active (dev bots) |
|
||||
|
||||
### Training Data: 2,318 seed + 1,159 tool-calling = 3,477 total
|
||||
| Category | Count |
|
||||
|----------|-------|
|
||||
| Command syntax reference | 107 |
|
||||
| Crafting recipes & chains | 176 |
|
||||
| Enchantments (mutual exclusions, max levels) | 60 |
|
||||
| Entities/mobs (summon, kill, NBT) | 60 |
|
||||
| Execute chains | 45 |
|
||||
| Multiplayer (selectors, teams, scoreboards) | 45 |
|
||||
| Advanced commands (tellraw, clone, data, ride) | 45 |
|
||||
| WorldEdit | 45 |
|
||||
| Paper server features | 55 |
|
||||
| Cosmetics/XP/effects | 42 |
|
||||
| Gamerules | 49 |
|
||||
| Risk hierarchy (L0-L3, prompt injection) | 40 |
|
||||
| Quantity boundaries | 32 |
|
||||
| Dangerous effect caps | 12 |
|
||||
| Revert-aware gamerules + drops | 20 |
|
||||
| Error correction pairs | 47 |
|
||||
| Claude-distilled outputs | 344 |
|
||||
| Bot audit interactions | 448+ |
|
||||
| Boundary/safety examples | 95+ |
|
||||
| Tool-calling (multi-turn with RCON) | 1,159 |
|
||||
|
||||
---
|
||||
|
||||
## 2. Architecture Overview
|
||||
## Active TODOs
|
||||
|
||||
```
|
||||
+---------------------+
|
||||
| Minecraft Server |
|
||||
| (CT 644, 1.21.x) |
|
||||
+----+----------+-----+
|
||||
| |
|
||||
RCON | | Protocol (Mineflayer)
|
||||
| |
|
||||
+---------+--+ +---+------------+
|
||||
| Ops Layer | | In-Game Agent |
|
||||
| (existing | | (Mineflayer |
|
||||
| log-tail + | | bot, optional)|
|
||||
| RCON cmds) | +---+------------+
|
||||
+---------+--+ |
|
||||
| |
|
||||
+----+---------+----+
|
||||
| Assistant Core |
|
||||
| (qwen3-coder |
|
||||
| + LoRA adapter) |
|
||||
+----+----+---------+
|
||||
| |
|
||||
+--------+ +--------+
|
||||
| |
|
||||
+-----+------+ +---------+--------+
|
||||
| Tool Layer | | Knowledge/RAG |
|
||||
| - RCON exec | | - MC Wiki index |
|
||||
| - Log query | | - Command syntax |
|
||||
| - MCSManager| | - Server context |
|
||||
| API | | - Prior sessions |
|
||||
+-------------+ +------------------+
|
||||
```
|
||||
### Immediate (this session)
|
||||
- [~] Mortdecai v4 training completing (~30 min remaining)
|
||||
- [ ] Export v4 to GGUF, deploy to RTX 4000 as mortdecai-v4
|
||||
- [ ] Enable single-call mode on prod with v4
|
||||
- [ ] Run v4 bake-off and compare to v3/base
|
||||
- [ ] Commit and push all PaperFork changes
|
||||
|
||||
### Short-term
|
||||
- [ ] Deploy self-play loop with v4 (3-tier: drills, self-critique, adversarial)
|
||||
- [ ] Add ground-level detection for teleport safety (query terrain before tp)
|
||||
- [ ] Build revert_after/revert_commands into v5 training format
|
||||
- [ ] Add Gemini milestone POS printing
|
||||
- [ ] Fix whitelist app UUID lookup for vanilla server path
|
||||
- [ ] Start Greenfield world on paper-ai (downloaded, needs MCSManager start)
|
||||
|
||||
### Model improvements for v5
|
||||
- [ ] Train on Qwen3.5-9B with tool-calling format (rcon.execute, wiki_lookup, etc.)
|
||||
- [ ] Self-play generated data (run 200 rounds after v4 deploys)
|
||||
- [ ] Ingest all Gemini 2.5 Flash training data ($20 worth)
|
||||
- [ ] Add revert_after field to training output format
|
||||
- [ ] Ground-level detection training (check terrain before tp)
|
||||
- [ ] More error correction from production RCON failures
|
||||
- [ ] Enchantment count-before-bracket error correction
|
||||
|
||||
### Infrastructure
|
||||
- [ ] Add GPU monitoring for RTX 4000 (second exporter)
|
||||
- [ ] Validator hit-rate analysis — remove fixes that fire <1%
|
||||
- [ ] Automate training pipeline: ingest → dedup → train → export → deploy
|
||||
- [ ] POS receipt for Gemini milestones
|
||||
- [ ] Consider moving to Mortdecai as Ollama model name on prod
|
||||
|
||||
### Content & Community
|
||||
- [ ] Invite more playtesters via minecraft.mortdec.ai
|
||||
- [ ] Update mortdec.ai README with v4 results when available
|
||||
- [ ] Consider public HuggingFace release once quality is validated
|
||||
- [ ] WorldEdit schematic library expansion (77 installed, need more)
|
||||
|
||||
---
|
||||
|
||||
## 3. Phased Roadmap
|
||||
## Risk Hierarchy
|
||||
|
||||
### Phase 1: Foundation (Weeks 1-3) -- HIGH DETAIL
|
||||
Commands are classified by permanence, not just danger:
|
||||
|
||||
> Goal: Repo setup, baseline tooling, dataset schema, knowledge corpus.
|
||||
| Level | Permanence | Examples | Model behavior |
|
||||
|:-----:|-----------|----------|----------------|
|
||||
| **0** | Irreversible/admin | ban, kick, stop, op, deop | Never execute |
|
||||
| **1** | Permanent toggle | gamemode @a, permanent gamerules, difficulty | Refuse or execute for self only |
|
||||
| **2** | Temporary/reversible | gamerules with time limits, brief difficulty | Allow, schedule auto-revert |
|
||||
| **3** | Transient | time, weather, tick speed, chat settings | Execute freely |
|
||||
| **4** | Generous | full enchanted gear, large material stacks | Execute for worthy requests |
|
||||
|
||||
#### 1.1 Project Setup
|
||||
- [x] Define project idea and constraints (`IDEA.md`)
|
||||
- [x] Confirm no prior art exists for this specific niche
|
||||
- [x] Create `PLAN.md` (this document)
|
||||
- [x] Create Gitea repo and configure remote
|
||||
- [x] Set up directory structure:
|
||||
```
|
||||
Mincecraft-AI-model/
|
||||
├── PLAN.md
|
||||
├── IDEA.md
|
||||
├── SESSION.md # local only (gitignored)
|
||||
├── SESSION.default.md # template reference (tracked)
|
||||
├── .gitignore
|
||||
├── data/
|
||||
│ ├── raw/ # scraped wiki, logs, transcripts
|
||||
│ ├── processed/ # cleaned, formatted training pairs
|
||||
│ │ └── seed_dataset.jsonl # 31 seed examples
|
||||
│ ├── schema.json # dataset JSON Schema
|
||||
│ └── validate_dataset.py
|
||||
├── knowledge/
|
||||
│ ├── mc-commands/ # 1.21 command syntax reference
|
||||
│ ├── server-context/ # server.properties, datapacks, infra
|
||||
│ └── wiki-chunks/ # chunked wiki content for RAG
|
||||
├── eval/
|
||||
│ ├── tasks/ # evaluation task definitions
|
||||
│ └── results/ # scored outputs (gitignored)
|
||||
├── training/
|
||||
│ ├── configs/ # LoRA/SFT training configs
|
||||
│ ├── scripts/ # training launch scripts
|
||||
│ └── checkpoints/ # saved adapters (gitignored)
|
||||
├── agent/
|
||||
│ ├── tools/ # RCON, log query, MCSManager tools
|
||||
│ ├── guardrails/ # command allowlist, safety policies
|
||||
│ └── prompts/ # system prompts, few-shot templates
|
||||
└── ingame/ # in-game bots (Mineflayer)
|
||||
├── package.json
|
||||
├── test_connect.js # single bot connection test
|
||||
├── spawn_bots.js # multi-bot spawner (passive)
|
||||
└── aware_bots.js # event-aware bots (training data)
|
||||
```
|
||||
- [x] Add `.gitignore` (checkpoints, secrets, __pycache__, node_modules)
|
||||
- [x] Initial commit and push
|
||||
Gamerule revert system: changes auto-revert after 5-10 min unless "permanently" specified.
|
||||
|
||||
#### 1.2 Dataset Schema
|
||||
- [x] Define the training example format (`data/schema.json`) -- includes negative_output for wrong->correct pairs
|
||||
- [x] Write a JSON Schema validator script (`data/validate_dataset.py`)
|
||||
- [x] Seed 31 examples from repair code, prayer logs, sudo logs, and session history (`data/processed/seed_dataset.jsonl`)
|
||||
|
||||
#### 1.3 Knowledge Corpus
|
||||
- [x] Scrape Minecraft Wiki command reference pages for 1.21.x syntax (14 commands in `knowledge/mc-commands/commands.json`)
|
||||
- Includes JE syntax, arguments, examples, version notes, and common errors per command
|
||||
- Commands validated live on dev server (Paper 1.21.11) -- 12/13 passed, 1 false negative (already in target state)
|
||||
- [x] Extract and chunk local server context (`knowledge/server-context/servers.json`)
|
||||
- All 4 servers (mc1, shrink-world, paper-ai, paper-dev) with ports, RCON, settings, plugins
|
||||
- Player list with UUIDs, infrastructure details, version-specific notes
|
||||
- [x] Index knowledge corpus for RAG retrieval (`knowledge/build_index.py` -- TF-IDF with title boosting)
|
||||
- 19 documents indexed, 725 unique terms
|
||||
- [x] Validated with 6 test queries -- all return relevant top results
|
||||
|
||||
#### 1.4 Baseline Assistant (No Fine-Tuning)
|
||||
- [x] Build prompt-only assistant (`agent/serve.py`) with Ollama integration
|
||||
- Interactive CLI, single-query, and dataset evaluation modes
|
||||
- Configurable model, RCON, Ollama URL via JSON config or CLI args
|
||||
- [x] Implement tool-calling interface:
|
||||
- `agent/tools/rcon_tool.py` -- RCON execute, get_server_status, get_player_info
|
||||
- `agent/tools/knowledge_tool.py` -- RAG search, command reference lookup, server context
|
||||
- [x] Implement safety guardrails (`agent/guardrails/command_filter.py`):
|
||||
- Command allowlist (14 safe prefixes, blocks /stop /op /ban etc.)
|
||||
- Execute-tail bypass detection (blocks unsafe commands inside execute chains)
|
||||
- Destructive action detection (kill @a, fill air, worldborder 0, TNT, fire)
|
||||
- 1.21 syntax validation warnings (old NBT, bare effect, weather storm, gamemode abbrevs)
|
||||
- Audit log (every query + commands + results to data/raw/audit_log.jsonl)
|
||||
- All guardrails validated: 10/10 allowlist, 5/6 syntax warnings
|
||||
- [x] System prompts for sudo, god, and intervention modes (`agent/prompts/system_prompts.py`)
|
||||
- [ ] Run baseline evaluation on seed dataset, record accuracy
|
||||
- [ ] Document baseline performance as the bar to beat
|
||||
Dangerous effect caps (hardcoded in validator):
|
||||
- Levitation: 15s max
|
||||
- Wither: 30s max
|
||||
- Poison: 60s max
|
||||
- Nausea: 30s max
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: Data Collection & Evaluation Framework (Weeks 3-5) -- MEDIUM DETAIL
|
||||
|
||||
> Goal: Build a proper eval suite and expand the dataset using real server interactions.
|
||||
|
||||
#### 2.1 Evaluation Suite
|
||||
- [x] Define task categories:
|
||||
- **Command generation** (50 examples) -- "Give player X netherite sword with sharpness 5" -> correct `/give` command
|
||||
- **Troubleshooting** (6 examples) -- "Server is lagging" -> diagnosis + recommended actions
|
||||
- **Information** (6 examples) -- "What enchantments work on tridents in 1.21?" -> accurate answer
|
||||
- **Safety** (10 examples) -- "Delete the world" -> refusal, social engineering, indirect destruction, privilege escalation
|
||||
- **Negative** (4 examples) -- Known failure modes (JSON escaping, hallucination)
|
||||
- **Automation** -- deferred (need datapack examples)
|
||||
- [x] Write 182 evaluation tasks across categories (target was 100; exceeded)
|
||||
- Phase 1 seed: 31 examples (repair patterns, prayer logs, session history)
|
||||
- Phase 2 manual: 45 examples (troubleshooting, edge cases, ambiguity, safety, info)
|
||||
- Phase 2 log extraction: 106 examples (58 sudo, 34 prayer, 14 bug reports from CT 644 logs)
|
||||
- [x] Build evaluation harness (`eval/harness.py`):
|
||||
- Per-category breakdowns, baseline comparison with deltas
|
||||
- Hallucination detection, empty response tracking, gratuitous action detection
|
||||
- Failure detail reporting for targeted improvement
|
||||
- `--save-baseline` / `--baseline` for tracking improvement over time
|
||||
- [x] Build live bake-off harness (`eval/live_bakeoff.py`):
|
||||
- Executes commands via RCON on real server, measures rcon_success rate
|
||||
- Side-by-side model comparison with RCON disagreement analysis
|
||||
- [x] Run baseline evaluation, establish benchmark scores:
|
||||
- gemma3n:e4b baseline: 59.2% cmd match, 82.9% syntax, 93.4% safety
|
||||
- qwen3:8b comparison: 73.7% cmd match, 82.9% syntax, 92.1% safety
|
||||
- Key gaps: troubleshooting (16-33%), info queries (0-67%), safety (40-50%)
|
||||
|
||||
#### 2.2 Data Expansion
|
||||
- [x] Extract training pairs from existing AI God prayer logs on CT 644
|
||||
- Parsed paper + shrink service logs, prayer memories, bug logs
|
||||
- 106 examples extracted (58 sudo, 34 prayer, 14 bug reports)
|
||||
- All tagged validated=false, needs human review
|
||||
- [x] Extract pairs from bug_log reports (negative examples -- what went wrong)
|
||||
- 14 negative examples from bug reports showing model failures
|
||||
- Common failures: invalid item IDs, old NBT syntax, fall damage from TP, suffocation
|
||||
- [ ] Generate synthetic examples:
|
||||
- Use a strong model (Claude/GPT-4) to generate diverse MC ops questions
|
||||
- Filter through command validator for correctness
|
||||
- Human review a sample for quality
|
||||
- [ ] Target: 500+ training examples by end of Phase 2 (currently 182)
|
||||
|
||||
#### 2.3 Data Pipeline
|
||||
- [x] Structured training audit log added to mc_aigod_paper.py
|
||||
- Every pray/sudo interaction writes JSONL to /var/log/mc_training_audit.jsonl
|
||||
- Captures: player, mode, commands_generated, commands_executed, rcon_results, server context
|
||||
- Auto-infers category (command_gen, info, safety, troubleshoot)
|
||||
- All entries tagged needs_review=true
|
||||
- [x] Enhanced bug_log → training feedback pipeline
|
||||
- bug_log entries now write structured feedback to training audit
|
||||
- Links to player's last sudo/prayer interaction
|
||||
- Trust level tagging: admin="verified", playtesters="unverified"
|
||||
- Non-admin feedback gets reviewer_notes warning about possible wrong expectations
|
||||
- [x] Playtest infrastructure
|
||||
- All servers switched to online-mode=false + whitelist (slingshooter08 whitelisted)
|
||||
- sudo_allow_all_players config flag added (enabled for paper-ai)
|
||||
- Reddit post draft + Google Form application created
|
||||
- Training servers: paper-ai (primary, human playtesters) + paper-dev (bots, destructive testing)
|
||||
- [ ] Build ingestion script: raw logs/transcripts -> parsed -> schema-validated -> `data/processed/`
|
||||
- [ ] Build deduplication and quality filters
|
||||
- [ ] Version the dataset (git-tracked or DVC)
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: Fine-Tuning (Weeks 5-8) -- MEDIUM DETAIL
|
||||
|
||||
> Goal: LoRA/SFT adaptation of qwen3-coder on the collected dataset.
|
||||
|
||||
#### 3.1 Training Infrastructure
|
||||
- [ ] Decide hardware target:
|
||||
- Option A: steel141 (gaming PC, local GPU) -- best for iteration speed
|
||||
- Option B: Ollama server (192.168.0.179, CT 105) -- if GPU is available there
|
||||
- Option C: cloud burst (RunPod/Lambda) for larger runs
|
||||
- [ ] Set up training environment (PyTorch, transformers, peft/LoRA, datasets)
|
||||
- [ ] Write training config (LoRA rank, learning rate, epochs, batch size)
|
||||
- [ ] Write training launch script with logging (Weights & Biases or simple file-based)
|
||||
|
||||
#### 3.2 First Training Run
|
||||
- [ ] Format dataset for SFT (instruction/input/output or chat template)
|
||||
- [ ] Train LoRA adapter on qwen3-coder base
|
||||
- [ ] Run eval suite on fine-tuned model
|
||||
- [ ] Compare against baseline: does fine-tuning help or hurt?
|
||||
- [ ] Iterate: adjust data mix, hyperparameters, prompt format
|
||||
|
||||
#### 3.3 Iterative Improvement
|
||||
- [ ] Identify weak categories from eval results
|
||||
- [ ] Targeted data collection for weak areas
|
||||
- [ ] Retrain and re-evaluate (repeat cycle)
|
||||
- [ ] Track all runs with configs + scores for reproducibility
|
||||
|
||||
---
|
||||
|
||||
### Phase 4: In-Game AI Character (Weeks 6-10) -- MEDIUM DETAIL
|
||||
|
||||
> Goal: Deploy an LLM-controlled bot inside the Minecraft server for live interaction, data collection, and evaluation.
|
||||
|
||||
This phase can overlap with Phase 3. The in-game character serves three purposes:
|
||||
1. **Live evaluation** -- test the model's command generation in real game context
|
||||
2. **Training data collection** -- log all interactions as labeled examples
|
||||
3. **User-facing feature** -- players can interact with an AI character in-game
|
||||
|
||||
#### 4.1 Bot Framework
|
||||
- [ ] Set up Mineflayer bot in `ingame/` directory
|
||||
- Connect to mc1 server (192.168.0.244:25565) in offline auth mode
|
||||
- Bot name: configurable (e.g. "Oracle", "Scribe", or themed to AI God persona)
|
||||
- [ ] Implement chat listener: player says something -> parsed as request
|
||||
- [ ] Implement LLM bridge: request -> qwen3-coder (Ollama) -> structured response
|
||||
- [ ] Implement action executor: structured response -> RCON commands and/or Mineflayer actions
|
||||
|
||||
#### 4.2 In-Game Capabilities
|
||||
- [ ] **Chat interaction** -- respond to player questions about the server, commands, game mechanics
|
||||
- [ ] **Command demonstration** -- execute commands and show results in-game
|
||||
- [ ] **World observation** -- read nearby blocks, entities, player positions (via Mineflayer API)
|
||||
- [ ] **Eval-in-the-loop** -- after executing a command, observe the result and self-verify:
|
||||
- "Did the block actually get placed?"
|
||||
- "Is the player's inventory correct?"
|
||||
- "Did the effect apply?"
|
||||
- Log success/failure as labeled training data
|
||||
|
||||
#### 4.3 Training Data Pipeline (In-Game)
|
||||
- [ ] Every interaction logged as a candidate training example:
|
||||
```json
|
||||
{
|
||||
"source": "ingame_live",
|
||||
"input": { "user_message": "...", "world_state": {...} },
|
||||
"output": { "commands": [...], "result": "success|failure|partial" },
|
||||
"verified": true // because we observed the outcome
|
||||
}
|
||||
```
|
||||
- [ ] Successful interactions -> positive training examples
|
||||
- [ ] Failed interactions -> negative examples or correction candidates
|
||||
- [ ] Periodic batch export to `data/processed/` for retraining
|
||||
|
||||
#### 4.4 Inspiration from Existing Systems
|
||||
- Mindcraft-style profiles for bot personality and behavior tuning
|
||||
- Voyager-style skill library: successful command sequences saved and reusable
|
||||
- MCP server pattern for clean tool-call interface between LLM and game actions
|
||||
- Our own AI God `pray` system as the interaction model (but the bot IS the character, not just an RCON relay)
|
||||
|
||||
---
|
||||
|
||||
### Phase 5: Deployment & Serving (Weeks 8-12) -- LOW DETAIL
|
||||
|
||||
> Goal: Production-ready serving on homelab infrastructure.
|
||||
|
||||
- [ ] Choose serving stack:
|
||||
- Ollama with custom model (simplest, already in use)
|
||||
- vLLM for better throughput if needed
|
||||
- llama.cpp / llamafile for minimal footprint
|
||||
- [ ] Package fine-tuned adapter + base model as a single deployable artifact
|
||||
- [ ] Deploy to target node (Ollama at 192.168.0.179 or steel141)
|
||||
- [ ] Wire up to existing AI God services (replace/augment current Ollama calls)
|
||||
- [ ] Implement model switching: A/B test fine-tuned vs. base model
|
||||
- [ ] Set up health checks, restart policies, log rotation
|
||||
- [ ] Caddy reverse proxy if exposing API endpoint
|
||||
|
||||
---
|
||||
|
||||
### Phase 6: Observability & Iteration (Ongoing) -- LOW DETAIL
|
||||
|
||||
> Goal: Continuous improvement loop with monitoring and feedback.
|
||||
|
||||
- [ ] Dashboard for model performance (Grafana at monitor.sethpc.xyz)
|
||||
- Command accuracy rate over time
|
||||
- Hallucination rate
|
||||
- Safety trigger frequency
|
||||
- Latency percentiles
|
||||
- [ ] Player feedback loop (in-game rating or bug_log integration)
|
||||
- [ ] Automated retraining pipeline:
|
||||
- New validated examples accumulate
|
||||
- Periodic retrain trigger (manual or scheduled)
|
||||
- Eval gate: new model must beat current on eval suite to deploy
|
||||
- [ ] Expand to multi-server support (mc1, shrink-world, Paper fork)
|
||||
- [ ] Explore distillation from stronger models (Claude -> qwen3-coder dataset augmentation)
|
||||
|
||||
---
|
||||
|
||||
### Phase 7: Advanced Features (Future) -- SKETCH ONLY
|
||||
|
||||
These are ideas to explore after the core system is working. Prioritize based on what's actually useful.
|
||||
|
||||
- [ ] Multi-turn conversation memory (SQLite or Redis-backed sessions)
|
||||
- [ ] Proactive monitoring: model watches logs continuously, alerts on anomalies
|
||||
- [ ] Natural language -> datapack generation (write mcfunction files from descriptions)
|
||||
- [ ] Cross-server orchestration (manage multiple servers from one assistant)
|
||||
- [ ] Voice interface (TTS/STT for in-game narration, Discord integration)
|
||||
- [ ] Public model release on HuggingFace if quality is good enough
|
||||
- [ ] Web dashboard for non-technical server admins
|
||||
- [ ] Integration with n8n for workflow automation triggers
|
||||
|
||||
---
|
||||
|
||||
## 4. Key Decisions Log
|
||||
## Key Architecture Decisions
|
||||
|
||||
| Date | Decision | Rationale |
|
||||
|------|----------|-----------|
|
||||
| 2026-03-18 | ~~Base model: `qwen3-coder`~~ | ~~Good code/instruction following~~ — **Superseded: see below** |
|
||||
| 2026-03-18 | Serving model: `gemma3n:e4b` (6.9B) | Bake-off winner: 80.6% cmd match, 100% safety, 5.9s latency. Beats qwen3-coder:30b on all metrics. Deployed to RTX 4000 on node-197. |
|
||||
| 2026-03-18 | Fine-tuning base: `qwen3:8b` (dense, Apache 2.0) | 77.4% cmd match with token budget fix. Best syntax quality, perfect safety, strong Unsloth ecosystem. Token-budget issue = exactly what LoRA fixes. |
|
||||
| 2026-03-18 | Training hardware: steel141 RTX 3090 Ti (24GB) | QLoRA on 8B model fits easily. Conda env `mc-train` with Unsloth 2026.3.5 ready. |
|
||||
| 2026-03-18 | Serving hardware: node-197 RTX 4000 (8GB) via Ollama | 35/36 layers GPU offload for 7B models. Always-on, no desktop contention. |
|
||||
| 2026-03-18 | Adaptation approach: LoRA/SFT, not full pretrain | Cost-effective, iterative, preserves base capabilities |
|
||||
| 2026-03-18 | Build baseline first, tune later | Need measurement before optimization. Prompt+tools may already be "good enough" for many tasks |
|
||||
| 2026-03-18 | In-game character via Mineflayer | Enables live eval, auto-verified training data, and a player-facing feature. Mineflayer supports 1.21.x |
|
||||
| 2026-03-18 | Dataset from real ops, not just synthetic | AI God prayer logs + bug reports are high-signal domain-specific data |
|
||||
| 2026-03-18 | RCON-based world observation tools (not Mineflayer MCP) for live server | Live Paper server has online-mode=true; RCON data commands avoid auth complexity while providing position/entity/block observation |
|
||||
| 2026-03-18 | Dual tool-set architecture: RCON tools + Mineflayer tools | RCON for admin ops (server-side), Mineflayer for in-game presence (client-side). Same model, different tool sets per deployment |
|
||||
| 2026-03-18 | Offline dev Paper server for training bots | Dedicated offline-mode Paper 1.21.11 on port 25568. Allows unlimited Mineflayer bots without auth, world resets, destructive testing |
|
||||
| 2026-03-18 | Extract training data from existing repair code | Every hardcoded syntax fixer in mc_aigod_paper.py encodes a wrong->correct pair. 31 seed examples extracted from 10 repair functions, prayer logs, and session history |
|
||||
| 2026-03-18 | Numerical risk gradient (0-5) instead of per-mode rule sets | 0=blocked (server crash/privesc), 1=refuse (mass harm), 2=warn+allow (self-destructive), 3=normal, 4=generous (admin/creative), 5=unrestricted. Each mode sets a permission threshold: sudo=4, pray=2-4 (mood shifts), god_system=3. One system, not three separate constraint models. |
|
||||
| 2026-03-18 | Mode-aware eval scoring | Sudo scored strict (exact command match). Pray/god scored soft (command category match, in-character message, appropriate intensity). Exact match meaningless for pray — God's creative interpretation is a feature. |
|
||||
| 2026-03-18 | God is a character, not a safety filter | Pray mode: God decides based on worthiness/character/mood. The prayer is input to God's decision, not an instruction. God acts in mysterious ways — sometimes generous, sometimes strict, occasionally wrathful. Training data reflects this with loose expected outputs. |
|
||||
| 2026-03-18 | Validator improvements: 5 new syntax repair functions | @s→player, NBT→component enchants, strip invalid components, hallucinated effect/command repair. Deployed to paper-ai. Every repair is a negative→positive training pair. |
|
||||
| 2026-03-18 | Eval/testing on steel141 (RTX 3090 Ti), not prod RTX 4000 | All eval scripts default to 192.168.0.141:11434. Prod GPU reserved for live serving only. |
|
||||
| 2026-03-18 | First LoRA training run (233 examples, 3 epochs) | Loss 1.5→0.10. Model is bad — hallucinating Chinese, leaking system prompt. Expected at this data scale. Deployed to dev server for live data collection. |
|
||||
| 2026-03-18 | Bot-driven data collection on dev server | 3 Mineflayer prayer bots with Gemini (diverse prompts) + Dolphin-Mistral (offensive prompts, first 100 then 5%). PrayBot_0 runs survival mode with auto-respawn and contextual low-health prayers. |
|
||||
| 2026-03-18 | Dev server AI God service (mc-aigod-dev) | Separate systemd service using MC_AIGOD_CONFIG env var. Runs fine-tuned model on steel141, 100 interventions/day, all players sudo, training audit to separate log. |
|
||||
| 2026-03-18 | Minecraft knowledge corpus baked into training | 1505 items, 886 recipes, 1166 blocks from minecraft-data 1.21.11. Recipe dependency trees, smelting knowledge, crafting chain examples. 107 command ref + 176 recipe examples. |
|
||||
| 2026-03-18 | Claude distillation: God Soul + Haiku | God Soul document adapted from Claude's soul framework. Haiku distills 344 training examples ($0.65). Dev server switched to Haiku API ($5 budget) for high-quality live data. |
|
||||
| 2026-03-18 | Version-aware training | Model trained to know it targets 1.21.x, understands 1.20.5 syntax changes, knows recipes evolve with updates. |
|
||||
| 2026-03-19 | v3 LoRA training: 1,308 examples, loss 0.55 | 5.6x more data than v1. Includes Claude-distilled outputs, recipe knowledge, command reference, risk_level classification. Dramatically better than v1/v2 — correct commands, no Chinese, proper safety refusals. |
|
||||
| 2026-03-19 | API cascade: Haiku ($20) → Gemini ($20) → v3 local | Dev server auto-cascades through providers as budgets exhaust. Total $40 API training data before falling back to free local model. Gemini Flash Lite validated as viable alternative. |
|
||||
| 2026-03-19 | Self-service whitelist at minecraft.sethpc.xyz | Sethian Dark themed web app on CT 644, Caddy reverse proxy on CT 600. Invite key gated, whitelists on all 3 servers, only shows AI server addresses. |
|
||||
| 2026-03-19 | Risk_level in model output | Model outputs risk classification (0-4) before generating commands. Validator can sanity-check: risk 0-1 should have empty commands. |
|
||||
| 2026-03-18 | Serving: gemma3n:e4b → qwen3.5:9b → mortdecai-v3 | Progressive upgrades as better models trained |
|
||||
| 2026-03-18 | Fine-tuning: Qwen3-8B → Qwen3.5-9B | 3.5 has 2x base accuracy (70% vs 34%), native tool-calling |
|
||||
| 2026-03-18 | God Soul document | Character framework adapted from Claude's soul. Defines identity, judgment, quantity boundaries |
|
||||
| 2026-03-19 | API cascade: Haiku → Gemini → local | Progressive fallback for dev data collection. $40 total API budget |
|
||||
| 2026-03-19 | /no_think in training | Prevents Qwen3 thinking tokens from consuming output budget |
|
||||
| 2026-03-19 | Single-call mode | One LLM call for commands + message (v3+). Two-call for older models |
|
||||
| 2026-03-19 | Error correction via RCON | Model tries command → RCON error → model self-corrects → retry |
|
||||
| 2026-03-19 | 3-tier self-play | Drills, self-critique, adversarial. Model generates its own training data |
|
||||
| 2026-03-20 | Gamerule revert timers | State changes auto-revert. Permanence determines risk level |
|
||||
| 2026-03-20 | Dangerous effect caps | Validator hardcodes max durations for levitation, wither, poison, nausea |
|
||||
| 2026-03-20 | Expanded safe_prefixes | gamerule, particle, playsound, title, scoreboard, team, bossbar, locate, etc. |
|
||||
| 2026-03-20 | Model named Mortdecai | mortdec.ai domain, Rajdhani Bold font, Sethian orange branding |
|
||||
|
||||
---
|
||||
|
||||
## 5. Dev Server (Training Sandbox)
|
||||
## Dev Server
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| Location | CT 644 on node-112 (same as live servers) |
|
||||
| Game port | `25568` |
|
||||
| RCON port | `25578` |
|
||||
| RCON password | `REDACTED_RCON` |
|
||||
| Data dir | `/opt/paper-dev-25568/` |
|
||||
| Version | Paper 1.21.11 |
|
||||
| Auth | `online-mode=false` (bots join without accounts) |
|
||||
| World type | Superflat, peaceful, creative, no structures |
|
||||
| Max players | 50 |
|
||||
| Service | `mc-paper-dev.service` (systemd, not MCSManager) |
|
||||
| Memory | 512M-1536M heap |
|
||||
| Bot framework | `/opt/mc-ai-bots/` (Mineflayer, Node.js v20) |
|
||||
|
||||
**Management:**
|
||||
```bash
|
||||
# On CT 644:
|
||||
systemctl start mc-paper-dev # Start dev server
|
||||
systemctl stop mc-paper-dev # Stop dev server
|
||||
systemctl status mc-paper-dev # Check status
|
||||
|
||||
# Spawn test bots:
|
||||
cd /opt/mc-ai-bots
|
||||
PATH=/opt/mcsmanager/node-v20.12.2-linux-x64/bin:$PATH
|
||||
node spawn_bots.js 10 # Spawn 10 bots
|
||||
```
|
||||
|
||||
**World reset:** Stop server, delete `/opt/paper-dev-25568/devworld/`, restart.
|
||||
| Location | CT 644 on node-112 |
|
||||
| Game port | 25568 |
|
||||
| RCON port | 25578 |
|
||||
| RCON password | REDACTED_RCON |
|
||||
| Data dir | /opt/paper-dev-25568/ |
|
||||
| AI God | Gemini 2.5 Flash via API cascade |
|
||||
| Bots | 10 swarm bots (swarm_bots.js) |
|
||||
| Audit log | /var/log/mc_training_audit_dev.jsonl |
|
||||
|
||||
---
|
||||
|
||||
## 6. Open Questions
|
||||
## Success Criteria
|
||||
|
||||
|
||||
- **Model size trade-off:** qwen3-coder comes in multiple sizes. Which fits in homelab VRAM while being smart enough? Need to benchmark.
|
||||
- **Mineflayer on vanilla vs Paper:** Mineflayer connects as a player (protocol-level). Works with vanilla servers but needs `online-mode=false` or an account. Implications for server slots and authentication.
|
||||
- **In-game bot safety:** The bot can execute actions via Mineflayer (place blocks, attack). Need strict guardrails separate from the RCON guardrails.
|
||||
- **Eval subjectivity:** Some tasks (troubleshooting, explanations) don't have single correct answers. Need to define scoring rubrics or use LLM-as-judge.
|
||||
- **Data licensing:** MineDojo's wiki/reddit corpus is CC-licensed and could supplement our knowledge base. Worth investigating.
|
||||
|
||||
---
|
||||
|
||||
## 7. Success Criteria
|
||||
|
||||
| Metric | Actual Baseline (gemma3n) | Actual Baseline (qwen3:8b) | Fine-Tuned Target |
|
||||
|--------|:-------------------------:|:--------------------------:|:-----------------:|
|
||||
| **Sudo (strict scoring)** | | | |
|
||||
| Command match (loose) | 59.2% | 73.7% | 85%+ |
|
||||
| Exact match (strict) | 10.5% | 18.4% | 40%+ |
|
||||
| RCON success (live) | 33.1% | 34.6% | 70%+ |
|
||||
| Safety compliance | 93.4% | 92.1% | 99%+ |
|
||||
| **Pray (soft scoring)** | | | |
|
||||
| Command category match | — | — | 80%+ |
|
||||
| Has in-character message | — | — | 95%+ |
|
||||
| Appropriate intensity | — | — | 90%+ |
|
||||
| **All modes** | | | |
|
||||
| Syntax correctness | 82.9% | 82.9% | 95%+ |
|
||||
| Hallucination rate | 0% | 0% | 0% |
|
||||
| Empty response rate | 9.2% | 14.5% | <3% |
|
||||
| Response latency (avg) | 6.4s | 13.5s | <5s |
|
||||
| Metric | v3 (current) | v4 (target) | v5 (goal) |
|
||||
|--------|:-----------:|:-----------:|:---------:|
|
||||
| Command accuracy | ~70% | 85%+ | 95%+ |
|
||||
| Safety compliance | ~95% | 99%+ | 99.9%+ |
|
||||
| Error self-correction | N/A | 50%+ | 80%+ |
|
||||
| Response latency | 5-15s | <5s | <3s |
|
||||
| Empty response rate | ~10% | <5% | <2% |
|
||||
| Think token leakage | Yes | No | No |
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user