# PLAN.md -- Project Roadmap (Live Document) > **Last updated:** 2026-03-18 (rev 2) > **Status legend:** `[ ]` planned | `[~]` in progress | `[x]` done | `[-]` cancelled/deferred --- ## 0. Vision Build a lightweight, Minecraft-focused AI assistant by adapting `qwen3-coder` (LoRA/SFT). The assistant operates as an **ops copilot** for Sethpc Minecraft servers -- generating correct commands, troubleshooting logs, automating admin tasks, and optionally acting as an **in-game AI character** for live interaction, training data collection, and evaluation. This is **not** a gameplay agent (like Voyager/MineDojo). It is a **server operations assistant** with an optional embodied presence for testing and data gathering. --- ## 1. Prior Art & Inspirations These projects informed the plan but solve different problems: | Project | What it does | What we borrow | |---------|-------------|----------------| | **Voyager** (6.7k stars) | LLM-powered embodied agent that plays Minecraft via Mineflayer. Skill library + auto-curriculum + iterative prompting. | Skill library concept (reusable verified command sequences). Iterative self-verification loop for command correctness. | | **MineDojo** (2.2k stars) | RL/LLM research framework with 3142 tasks. Internet-scale knowledge base (730K YouTube vids, 7K wiki pages, 340K Reddit posts). | Knowledge corpus pipeline -- scraping wiki.vg and Minecraft Wiki for command syntax reference data. Task-based evaluation structure. | | **Mindcraft** (4.9k stars) | LLM + Mineflayer in-game bots with profiles, multi-agent collab. Supports Ollama, many APIs. | Profile-based bot architecture. In-game chat integration pattern. Ollama local model support. Provides own fine-tuned models (`sweaterdog/andy-4`). | | **minecraft-mcp-server** (514 stars) | MCP (Model Context Protocol) server wrapping Mineflayer. Lets Claude/LLMs control a Minecraft character via tool calls. | MCP tool-call interface for in-game actions. Could be adapted for our eval harness. | | **Mineflayer** (6.7k stars) | Node.js Minecraft bot framework. Supports 1.8-1.21.11. Movement, inventory, chat, block interaction. | Primary framework for in-game AI character. Mature, well-maintained, 1.21 support confirmed. | | **Existing AI God system** (our own) | Log-tail + RCON + Ollama pipeline. `pray` trigger, divine intervention, command validation, syntax repair. Vanilla + Paper fork. | Direct predecessor. Baseline to measure against. Source of real training data (prayer logs, bug reports). | --- ## 2. Architecture Overview ``` +---------------------+ | Minecraft Server | | (CT 644, 1.21.x) | +----+----------+-----+ | | RCON | | Protocol (Mineflayer) | | +---------+--+ +---+------------+ | Ops Layer | | In-Game Agent | | (existing | | (Mineflayer | | log-tail + | | bot, optional)| | RCON cmds) | +---+------------+ +---------+--+ | | | +----+---------+----+ | Assistant Core | | (qwen3-coder | | + LoRA adapter) | +----+----+---------+ | | +--------+ +--------+ | | +-----+------+ +---------+--------+ | Tool Layer | | Knowledge/RAG | | - RCON exec | | - MC Wiki index | | - Log query | | - Command syntax | | - MCSManager| | - Server context | | API | | - Prior sessions | +-------------+ +------------------+ ``` --- ## 3. Phased Roadmap ### Phase 1: Foundation (Weeks 1-3) -- HIGH DETAIL > Goal: Repo setup, baseline tooling, dataset schema, knowledge corpus. #### 1.1 Project Setup - [x] Define project idea and constraints (`IDEA.md`) - [x] Confirm no prior art exists for this specific niche - [x] Create `PLAN.md` (this document) - [x] Create Gitea repo and configure remote - [x] Set up directory structure: ``` Mincecraft-AI-model/ ├── PLAN.md ├── IDEA.md ├── SESSION.md # local only (gitignored) ├── SESSION.default.md # template reference (tracked) ├── .gitignore ├── data/ │ ├── raw/ # scraped wiki, logs, transcripts │ ├── processed/ # cleaned, formatted training pairs │ │ └── seed_dataset.jsonl # 31 seed examples │ ├── schema.json # dataset JSON Schema │ └── validate_dataset.py ├── knowledge/ │ ├── mc-commands/ # 1.21 command syntax reference │ ├── server-context/ # server.properties, datapacks, infra │ └── wiki-chunks/ # chunked wiki content for RAG ├── eval/ │ ├── tasks/ # evaluation task definitions │ └── results/ # scored outputs (gitignored) ├── training/ │ ├── configs/ # LoRA/SFT training configs │ ├── scripts/ # training launch scripts │ └── checkpoints/ # saved adapters (gitignored) ├── agent/ │ ├── tools/ # RCON, log query, MCSManager tools │ ├── guardrails/ # command allowlist, safety policies │ └── prompts/ # system prompts, few-shot templates └── ingame/ # in-game bots (Mineflayer) ├── package.json ├── test_connect.js # single bot connection test ├── spawn_bots.js # multi-bot spawner (passive) └── aware_bots.js # event-aware bots (training data) ``` - [x] Add `.gitignore` (checkpoints, secrets, __pycache__, node_modules) - [x] Initial commit and push #### 1.2 Dataset Schema - [x] Define the training example format (`data/schema.json`) -- includes negative_output for wrong->correct pairs - [x] Write a JSON Schema validator script (`data/validate_dataset.py`) - [x] Seed 31 examples from repair code, prayer logs, sudo logs, and session history (`data/processed/seed_dataset.jsonl`) #### 1.3 Knowledge Corpus - [x] Scrape Minecraft Wiki command reference pages for 1.21.x syntax (14 commands in `knowledge/mc-commands/commands.json`) - Includes JE syntax, arguments, examples, version notes, and common errors per command - Commands validated live on dev server (Paper 1.21.11) -- 12/13 passed, 1 false negative (already in target state) - [x] Extract and chunk local server context (`knowledge/server-context/servers.json`) - All 4 servers (mc1, shrink-world, paper-ai, paper-dev) with ports, RCON, settings, plugins - Player list with UUIDs, infrastructure details, version-specific notes - [x] Index knowledge corpus for RAG retrieval (`knowledge/build_index.py` -- TF-IDF with title boosting) - 19 documents indexed, 725 unique terms - [x] Validated with 6 test queries -- all return relevant top results #### 1.4 Baseline Assistant (No Fine-Tuning) - [x] Build prompt-only assistant (`agent/serve.py`) with Ollama integration - Interactive CLI, single-query, and dataset evaluation modes - Configurable model, RCON, Ollama URL via JSON config or CLI args - [x] Implement tool-calling interface: - `agent/tools/rcon_tool.py` -- RCON execute, get_server_status, get_player_info - `agent/tools/knowledge_tool.py` -- RAG search, command reference lookup, server context - [x] Implement safety guardrails (`agent/guardrails/command_filter.py`): - Command allowlist (14 safe prefixes, blocks /stop /op /ban etc.) - Execute-tail bypass detection (blocks unsafe commands inside execute chains) - Destructive action detection (kill @a, fill air, worldborder 0, TNT, fire) - 1.21 syntax validation warnings (old NBT, bare effect, weather storm, gamemode abbrevs) - Audit log (every query + commands + results to data/raw/audit_log.jsonl) - All guardrails validated: 10/10 allowlist, 5/6 syntax warnings - [x] System prompts for sudo, god, and intervention modes (`agent/prompts/system_prompts.py`) - [ ] Run baseline evaluation on seed dataset, record accuracy - [ ] Document baseline performance as the bar to beat --- ### Phase 2: Data Collection & Evaluation Framework (Weeks 3-5) -- MEDIUM DETAIL > Goal: Build a proper eval suite and expand the dataset using real server interactions. #### 2.1 Evaluation Suite - [x] Define task categories: - **Command generation** (50 examples) -- "Give player X netherite sword with sharpness 5" -> correct `/give` command - **Troubleshooting** (6 examples) -- "Server is lagging" -> diagnosis + recommended actions - **Information** (6 examples) -- "What enchantments work on tridents in 1.21?" -> accurate answer - **Safety** (10 examples) -- "Delete the world" -> refusal, social engineering, indirect destruction, privilege escalation - **Negative** (4 examples) -- Known failure modes (JSON escaping, hallucination) - **Automation** -- deferred (need datapack examples) - [x] Write 182 evaluation tasks across categories (target was 100; exceeded) - Phase 1 seed: 31 examples (repair patterns, prayer logs, session history) - Phase 2 manual: 45 examples (troubleshooting, edge cases, ambiguity, safety, info) - Phase 2 log extraction: 106 examples (58 sudo, 34 prayer, 14 bug reports from CT 644 logs) - [x] Build evaluation harness (`eval/harness.py`): - Per-category breakdowns, baseline comparison with deltas - Hallucination detection, empty response tracking, gratuitous action detection - Failure detail reporting for targeted improvement - `--save-baseline` / `--baseline` for tracking improvement over time - [x] Build live bake-off harness (`eval/live_bakeoff.py`): - Executes commands via RCON on real server, measures rcon_success rate - Side-by-side model comparison with RCON disagreement analysis - [x] Run baseline evaluation, establish benchmark scores: - gemma3n:e4b baseline: 59.2% cmd match, 82.9% syntax, 93.4% safety - qwen3:8b comparison: 73.7% cmd match, 82.9% syntax, 92.1% safety - Key gaps: troubleshooting (16-33%), info queries (0-67%), safety (40-50%) #### 2.2 Data Expansion - [x] Extract training pairs from existing AI God prayer logs on CT 644 - Parsed paper + shrink service logs, prayer memories, bug logs - 106 examples extracted (58 sudo, 34 prayer, 14 bug reports) - All tagged validated=false, needs human review - [x] Extract pairs from bug_log reports (negative examples -- what went wrong) - 14 negative examples from bug reports showing model failures - Common failures: invalid item IDs, old NBT syntax, fall damage from TP, suffocation - [ ] Generate synthetic examples: - Use a strong model (Claude/GPT-4) to generate diverse MC ops questions - Filter through command validator for correctness - Human review a sample for quality - [ ] Target: 500+ training examples by end of Phase 2 (currently 182) #### 2.3 Data Pipeline - [x] Structured training audit log added to mc_aigod_paper.py - Every pray/sudo interaction writes JSONL to /var/log/mc_training_audit.jsonl - Captures: player, mode, commands_generated, commands_executed, rcon_results, server context - Auto-infers category (command_gen, info, safety, troubleshoot) - All entries tagged needs_review=true - [x] Enhanced bug_log → training feedback pipeline - bug_log entries now write structured feedback to training audit - Links to player's last sudo/prayer interaction - Trust level tagging: admin="verified", playtesters="unverified" - Non-admin feedback gets reviewer_notes warning about possible wrong expectations - [x] Playtest infrastructure - All servers switched to online-mode=false + whitelist (slingshooter08 whitelisted) - sudo_allow_all_players config flag added (enabled for paper-ai) - Reddit post draft + Google Form application created - Training servers: paper-ai (primary, human playtesters) + paper-dev (bots, destructive testing) - [ ] Build ingestion script: raw logs/transcripts -> parsed -> schema-validated -> `data/processed/` - [ ] Build deduplication and quality filters - [ ] Version the dataset (git-tracked or DVC) --- ### Phase 3: Fine-Tuning (Weeks 5-8) -- MEDIUM DETAIL > Goal: LoRA/SFT adaptation of qwen3-coder on the collected dataset. #### 3.1 Training Infrastructure - [ ] Decide hardware target: - Option A: steel141 (gaming PC, local GPU) -- best for iteration speed - Option B: Ollama server (192.168.0.179, CT 105) -- if GPU is available there - Option C: cloud burst (RunPod/Lambda) for larger runs - [ ] Set up training environment (PyTorch, transformers, peft/LoRA, datasets) - [ ] Write training config (LoRA rank, learning rate, epochs, batch size) - [ ] Write training launch script with logging (Weights & Biases or simple file-based) #### 3.2 First Training Run - [ ] Format dataset for SFT (instruction/input/output or chat template) - [ ] Train LoRA adapter on qwen3-coder base - [ ] Run eval suite on fine-tuned model - [ ] Compare against baseline: does fine-tuning help or hurt? - [ ] Iterate: adjust data mix, hyperparameters, prompt format #### 3.3 Iterative Improvement - [ ] Identify weak categories from eval results - [ ] Targeted data collection for weak areas - [ ] Retrain and re-evaluate (repeat cycle) - [ ] Track all runs with configs + scores for reproducibility --- ### Phase 4: In-Game AI Character (Weeks 6-10) -- MEDIUM DETAIL > Goal: Deploy an LLM-controlled bot inside the Minecraft server for live interaction, data collection, and evaluation. This phase can overlap with Phase 3. The in-game character serves three purposes: 1. **Live evaluation** -- test the model's command generation in real game context 2. **Training data collection** -- log all interactions as labeled examples 3. **User-facing feature** -- players can interact with an AI character in-game #### 4.1 Bot Framework - [ ] Set up Mineflayer bot in `ingame/` directory - Connect to mc1 server (192.168.0.244:25565) in offline auth mode - Bot name: configurable (e.g. "Oracle", "Scribe", or themed to AI God persona) - [ ] Implement chat listener: player says something -> parsed as request - [ ] Implement LLM bridge: request -> qwen3-coder (Ollama) -> structured response - [ ] Implement action executor: structured response -> RCON commands and/or Mineflayer actions #### 4.2 In-Game Capabilities - [ ] **Chat interaction** -- respond to player questions about the server, commands, game mechanics - [ ] **Command demonstration** -- execute commands and show results in-game - [ ] **World observation** -- read nearby blocks, entities, player positions (via Mineflayer API) - [ ] **Eval-in-the-loop** -- after executing a command, observe the result and self-verify: - "Did the block actually get placed?" - "Is the player's inventory correct?" - "Did the effect apply?" - Log success/failure as labeled training data #### 4.3 Training Data Pipeline (In-Game) - [ ] Every interaction logged as a candidate training example: ```json { "source": "ingame_live", "input": { "user_message": "...", "world_state": {...} }, "output": { "commands": [...], "result": "success|failure|partial" }, "verified": true // because we observed the outcome } ``` - [ ] Successful interactions -> positive training examples - [ ] Failed interactions -> negative examples or correction candidates - [ ] Periodic batch export to `data/processed/` for retraining #### 4.4 Inspiration from Existing Systems - Mindcraft-style profiles for bot personality and behavior tuning - Voyager-style skill library: successful command sequences saved and reusable - MCP server pattern for clean tool-call interface between LLM and game actions - Our own AI God `pray` system as the interaction model (but the bot IS the character, not just an RCON relay) --- ### Phase 5: Deployment & Serving (Weeks 8-12) -- LOW DETAIL > Goal: Production-ready serving on homelab infrastructure. - [ ] Choose serving stack: - Ollama with custom model (simplest, already in use) - vLLM for better throughput if needed - llama.cpp / llamafile for minimal footprint - [ ] Package fine-tuned adapter + base model as a single deployable artifact - [ ] Deploy to target node (Ollama at 192.168.0.179 or steel141) - [ ] Wire up to existing AI God services (replace/augment current Ollama calls) - [ ] Implement model switching: A/B test fine-tuned vs. base model - [ ] Set up health checks, restart policies, log rotation - [ ] Caddy reverse proxy if exposing API endpoint --- ### Phase 6: Observability & Iteration (Ongoing) -- LOW DETAIL > Goal: Continuous improvement loop with monitoring and feedback. - [ ] Dashboard for model performance (Grafana at monitor.sethpc.xyz) - Command accuracy rate over time - Hallucination rate - Safety trigger frequency - Latency percentiles - [ ] Player feedback loop (in-game rating or bug_log integration) - [ ] Automated retraining pipeline: - New validated examples accumulate - Periodic retrain trigger (manual or scheduled) - Eval gate: new model must beat current on eval suite to deploy - [ ] Expand to multi-server support (mc1, shrink-world, Paper fork) - [ ] Explore distillation from stronger models (Claude -> qwen3-coder dataset augmentation) --- ### Phase 7: Advanced Features (Future) -- SKETCH ONLY These are ideas to explore after the core system is working. Prioritize based on what's actually useful. - [ ] Multi-turn conversation memory (SQLite or Redis-backed sessions) - [ ] Proactive monitoring: model watches logs continuously, alerts on anomalies - [ ] Natural language -> datapack generation (write mcfunction files from descriptions) - [ ] Cross-server orchestration (manage multiple servers from one assistant) - [ ] Voice interface (TTS/STT for in-game narration, Discord integration) - [ ] Public model release on HuggingFace if quality is good enough - [ ] Web dashboard for non-technical server admins - [ ] Integration with n8n for workflow automation triggers --- ## 4. Key Decisions Log | Date | Decision | Rationale | |------|----------|-----------| | 2026-03-18 | ~~Base model: `qwen3-coder`~~ | ~~Good code/instruction following~~ — **Superseded: see below** | | 2026-03-18 | Serving model: `gemma3n:e4b` (6.9B) | Bake-off winner: 80.6% cmd match, 100% safety, 5.9s latency. Beats qwen3-coder:30b on all metrics. Deployed to RTX 4000 on node-197. | | 2026-03-18 | Fine-tuning base: `qwen3:8b` (dense, Apache 2.0) | 77.4% cmd match with token budget fix. Best syntax quality, perfect safety, strong Unsloth ecosystem. Token-budget issue = exactly what LoRA fixes. | | 2026-03-18 | Training hardware: steel141 RTX 3090 Ti (24GB) | QLoRA on 8B model fits easily. Conda env `mc-train` with Unsloth 2026.3.5 ready. | | 2026-03-18 | Serving hardware: node-197 RTX 4000 (8GB) via Ollama | 35/36 layers GPU offload for 7B models. Always-on, no desktop contention. | | 2026-03-18 | Adaptation approach: LoRA/SFT, not full pretrain | Cost-effective, iterative, preserves base capabilities | | 2026-03-18 | Build baseline first, tune later | Need measurement before optimization. Prompt+tools may already be "good enough" for many tasks | | 2026-03-18 | In-game character via Mineflayer | Enables live eval, auto-verified training data, and a player-facing feature. Mineflayer supports 1.21.x | | 2026-03-18 | Dataset from real ops, not just synthetic | AI God prayer logs + bug reports are high-signal domain-specific data | | 2026-03-18 | RCON-based world observation tools (not Mineflayer MCP) for live server | Live Paper server has online-mode=true; RCON data commands avoid auth complexity while providing position/entity/block observation | | 2026-03-18 | Dual tool-set architecture: RCON tools + Mineflayer tools | RCON for admin ops (server-side), Mineflayer for in-game presence (client-side). Same model, different tool sets per deployment | | 2026-03-18 | Offline dev Paper server for training bots | Dedicated offline-mode Paper 1.21.11 on port 25568. Allows unlimited Mineflayer bots without auth, world resets, destructive testing | | 2026-03-18 | Extract training data from existing repair code | Every hardcoded syntax fixer in mc_aigod_paper.py encodes a wrong->correct pair. 31 seed examples extracted from 10 repair functions, prayer logs, and session history | | 2026-03-18 | Numerical risk gradient (0-5) instead of per-mode rule sets | 0=blocked (server crash/privesc), 1=refuse (mass harm), 2=warn+allow (self-destructive), 3=normal, 4=generous (admin/creative), 5=unrestricted. Each mode sets a permission threshold: sudo=4, pray=2-4 (mood shifts), god_system=3. One system, not three separate constraint models. | | 2026-03-18 | Mode-aware eval scoring | Sudo scored strict (exact command match). Pray/god scored soft (command category match, in-character message, appropriate intensity). Exact match meaningless for pray — God's creative interpretation is a feature. | | 2026-03-18 | God is a character, not a safety filter | Pray mode: God decides based on worthiness/character/mood. The prayer is input to God's decision, not an instruction. God acts in mysterious ways — sometimes generous, sometimes strict, occasionally wrathful. Training data reflects this with loose expected outputs. | | 2026-03-18 | Validator improvements: 5 new syntax repair functions | @s→player, NBT→component enchants, strip invalid components, hallucinated effect/command repair. Deployed to paper-ai. Every repair is a negative→positive training pair. | | 2026-03-18 | Eval/testing on steel141 (RTX 3090 Ti), not prod RTX 4000 | All eval scripts default to 192.168.0.141:11434. Prod GPU reserved for live serving only. | | 2026-03-18 | First LoRA training run (233 examples, 3 epochs) | Loss 1.5→0.10. Model is bad — hallucinating Chinese, leaking system prompt. Expected at this data scale. Deployed to dev server for live data collection. | | 2026-03-18 | Bot-driven data collection on dev server | 3 Mineflayer prayer bots with Gemini (diverse prompts) + Dolphin-Mistral (offensive prompts, first 100 then 5%). PrayBot_0 runs survival mode with auto-respawn and contextual low-health prayers. | | 2026-03-18 | Dev server AI God service (mc-aigod-dev) | Separate systemd service using MC_AIGOD_CONFIG env var. Runs fine-tuned model on steel141, 100 interventions/day, all players sudo, training audit to separate log. | | 2026-03-18 | Minecraft knowledge corpus baked into training | 1505 items, 886 recipes, 1166 blocks from minecraft-data 1.21.11. Recipe dependency trees, smelting knowledge, crafting chain examples. 107 command ref + 176 recipe examples. | | 2026-03-18 | Claude distillation: God Soul + Haiku | God Soul document adapted from Claude's soul framework. Haiku distills 344 training examples ($0.65). Dev server switched to Haiku API ($5 budget) for high-quality live data. | | 2026-03-18 | Version-aware training | Model trained to know it targets 1.21.x, understands 1.20.5 syntax changes, knows recipes evolve with updates. | --- ## 5. Dev Server (Training Sandbox) | Property | Value | |----------|-------| | Location | CT 644 on node-112 (same as live servers) | | Game port | `25568` | | RCON port | `25578` | | RCON password | `REDACTED_RCON` | | Data dir | `/opt/paper-dev-25568/` | | Version | Paper 1.21.11 | | Auth | `online-mode=false` (bots join without accounts) | | World type | Superflat, peaceful, creative, no structures | | Max players | 50 | | Service | `mc-paper-dev.service` (systemd, not MCSManager) | | Memory | 512M-1536M heap | | Bot framework | `/opt/mc-ai-bots/` (Mineflayer, Node.js v20) | **Management:** ```bash # On CT 644: systemctl start mc-paper-dev # Start dev server systemctl stop mc-paper-dev # Stop dev server systemctl status mc-paper-dev # Check status # Spawn test bots: cd /opt/mc-ai-bots PATH=/opt/mcsmanager/node-v20.12.2-linux-x64/bin:$PATH node spawn_bots.js 10 # Spawn 10 bots ``` **World reset:** Stop server, delete `/opt/paper-dev-25568/devworld/`, restart. --- ## 6. Open Questions - **Model size trade-off:** qwen3-coder comes in multiple sizes. Which fits in homelab VRAM while being smart enough? Need to benchmark. - **Mineflayer on vanilla vs Paper:** Mineflayer connects as a player (protocol-level). Works with vanilla servers but needs `online-mode=false` or an account. Implications for server slots and authentication. - **In-game bot safety:** The bot can execute actions via Mineflayer (place blocks, attack). Need strict guardrails separate from the RCON guardrails. - **Eval subjectivity:** Some tasks (troubleshooting, explanations) don't have single correct answers. Need to define scoring rubrics or use LLM-as-judge. - **Data licensing:** MineDojo's wiki/reddit corpus is CC-licensed and could supplement our knowledge base. Worth investigating. --- ## 7. Success Criteria | Metric | Actual Baseline (gemma3n) | Actual Baseline (qwen3:8b) | Fine-Tuned Target | |--------|:-------------------------:|:--------------------------:|:-----------------:| | **Sudo (strict scoring)** | | | | | Command match (loose) | 59.2% | 73.7% | 85%+ | | Exact match (strict) | 10.5% | 18.4% | 40%+ | | RCON success (live) | 33.1% | 34.6% | 70%+ | | Safety compliance | 93.4% | 92.1% | 99%+ | | **Pray (soft scoring)** | | | | | Command category match | — | — | 80%+ | | Has in-character message | — | — | 95%+ | | Appropriate intensity | — | — | 90%+ | | **All modes** | | | | | Syntax correctness | 82.9% | 82.9% | 95%+ | | Hallucination rate | 0% | 0% | 0% | | Empty response rate | 9.2% | 14.5% | <3% | | Response latency (avg) | 6.4s | 13.5s | <5s | --- *This document is updated as the project evolves. Check git history for previous versions.*