Files
Mortdecai/PLAN.md
T
Seth 78031d16c0 Risk gradient (0-5), updated system prompts, 233 examples
Risk gradient system:
- All 233 training examples tagged with risk_level (0-5)
- 0=blocked(15), 1=refuse(9), 2=warn(17), 3=normal(169), 4=generous(23)
- Schema updated with risk_level and scoring_mode fields
- Eval harness uses risk_level for safety scoring

System prompts rewritten:
- Shared syntax rules and risk gradient reference across all modes
- Sudo: permission level 4, do what admin asks, only refuse level 0-1
- God: permission level 2-4 (mood-dependent), character-driven decisions
- God_system: permission level 3, 80% benevolent / 15% mischievous / 5% wrathful

Data:
- 20 new live playtest examples from training audit log (233 total)
- 43 wrong→right pairs (17 from validator repairs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 16:14:54 -04:00

24 KiB

PLAN.md -- Project Roadmap (Live Document)

Last updated: 2026-03-18 (rev 2) Status legend: [ ] planned | [~] in progress | [x] done | [-] cancelled/deferred


0. Vision

Build a lightweight, Minecraft-focused AI assistant by adapting qwen3-coder (LoRA/SFT). The assistant operates as an ops copilot for Sethpc Minecraft servers -- generating correct commands, troubleshooting logs, automating admin tasks, and optionally acting as an in-game AI character for live interaction, training data collection, and evaluation.

This is not a gameplay agent (like Voyager/MineDojo). It is a server operations assistant with an optional embodied presence for testing and data gathering.


1. Prior Art & Inspirations

These projects informed the plan but solve different problems:

Project What it does What we borrow
Voyager (6.7k stars) LLM-powered embodied agent that plays Minecraft via Mineflayer. Skill library + auto-curriculum + iterative prompting. Skill library concept (reusable verified command sequences). Iterative self-verification loop for command correctness.
MineDojo (2.2k stars) RL/LLM research framework with 3142 tasks. Internet-scale knowledge base (730K YouTube vids, 7K wiki pages, 340K Reddit posts). Knowledge corpus pipeline -- scraping wiki.vg and Minecraft Wiki for command syntax reference data. Task-based evaluation structure.
Mindcraft (4.9k stars) LLM + Mineflayer in-game bots with profiles, multi-agent collab. Supports Ollama, many APIs. Profile-based bot architecture. In-game chat integration pattern. Ollama local model support. Provides own fine-tuned models (sweaterdog/andy-4).
minecraft-mcp-server (514 stars) MCP (Model Context Protocol) server wrapping Mineflayer. Lets Claude/LLMs control a Minecraft character via tool calls. MCP tool-call interface for in-game actions. Could be adapted for our eval harness.
Mineflayer (6.7k stars) Node.js Minecraft bot framework. Supports 1.8-1.21.11. Movement, inventory, chat, block interaction. Primary framework for in-game AI character. Mature, well-maintained, 1.21 support confirmed.
Existing AI God system (our own) Log-tail + RCON + Ollama pipeline. pray trigger, divine intervention, command validation, syntax repair. Vanilla + Paper fork. Direct predecessor. Baseline to measure against. Source of real training data (prayer logs, bug reports).

2. Architecture Overview

                    +---------------------+
                    |   Minecraft Server   |
                    |  (CT 644, 1.21.x)   |
                    +----+----------+-----+
                         |          |
                    RCON |          | Protocol (Mineflayer)
                         |          |
               +---------+--+  +---+------------+
               | Ops Layer   |  | In-Game Agent  |
               | (existing   |  | (Mineflayer    |
               |  log-tail + |  |  bot, optional)|
               |  RCON cmds) |  +---+------------+
               +---------+--+      |
                         |         |
                    +----+---------+----+
                    |  Assistant Core   |
                    |  (qwen3-coder     |
                    |   + LoRA adapter) |
                    +----+----+---------+
                         |    |
                +--------+    +--------+
                |                      |
          +-----+------+    +---------+--------+
          | Tool Layer  |    | Knowledge/RAG    |
          | - RCON exec |    | - MC Wiki index  |
          | - Log query |    | - Command syntax |
          | - MCSManager|    | - Server context |
          |   API       |    | - Prior sessions |
          +-------------+    +------------------+

3. Phased Roadmap

Phase 1: Foundation (Weeks 1-3) -- HIGH DETAIL

Goal: Repo setup, baseline tooling, dataset schema, knowledge corpus.

1.1 Project Setup

  • Define project idea and constraints (IDEA.md)
  • Confirm no prior art exists for this specific niche
  • Create PLAN.md (this document)
  • Create Gitea repo and configure remote
  • Set up directory structure:
    Mincecraft-AI-model/
    ├── PLAN.md
    ├── IDEA.md
    ├── SESSION.md             # local only (gitignored)
    ├── SESSION.default.md     # template reference (tracked)
    ├── .gitignore
    ├── data/
    │   ├── raw/               # scraped wiki, logs, transcripts
    │   ├── processed/         # cleaned, formatted training pairs
    │   │   └── seed_dataset.jsonl  # 31 seed examples
    │   ├── schema.json        # dataset JSON Schema
    │   └── validate_dataset.py
    ├── knowledge/
    │   ├── mc-commands/       # 1.21 command syntax reference
    │   ├── server-context/    # server.properties, datapacks, infra
    │   └── wiki-chunks/       # chunked wiki content for RAG
    ├── eval/
    │   ├── tasks/             # evaluation task definitions
    │   └── results/           # scored outputs (gitignored)
    ├── training/
    │   ├── configs/           # LoRA/SFT training configs
    │   ├── scripts/           # training launch scripts
    │   └── checkpoints/       # saved adapters (gitignored)
    ├── agent/
    │   ├── tools/             # RCON, log query, MCSManager tools
    │   ├── guardrails/        # command allowlist, safety policies
    │   └── prompts/           # system prompts, few-shot templates
    └── ingame/                # in-game bots (Mineflayer)
        ├── package.json
        ├── test_connect.js    # single bot connection test
        ├── spawn_bots.js      # multi-bot spawner (passive)
        └── aware_bots.js      # event-aware bots (training data)
    
  • Add .gitignore (checkpoints, secrets, pycache, node_modules)
  • Initial commit and push

1.2 Dataset Schema

  • Define the training example format (data/schema.json) -- includes negative_output for wrong->correct pairs
  • Write a JSON Schema validator script (data/validate_dataset.py)
  • Seed 31 examples from repair code, prayer logs, sudo logs, and session history (data/processed/seed_dataset.jsonl)

1.3 Knowledge Corpus

  • Scrape Minecraft Wiki command reference pages for 1.21.x syntax (14 commands in knowledge/mc-commands/commands.json)
    • Includes JE syntax, arguments, examples, version notes, and common errors per command
    • Commands validated live on dev server (Paper 1.21.11) -- 12/13 passed, 1 false negative (already in target state)
  • Extract and chunk local server context (knowledge/server-context/servers.json)
    • All 4 servers (mc1, shrink-world, paper-ai, paper-dev) with ports, RCON, settings, plugins
    • Player list with UUIDs, infrastructure details, version-specific notes
  • Index knowledge corpus for RAG retrieval (knowledge/build_index.py -- TF-IDF with title boosting)
    • 19 documents indexed, 725 unique terms
  • Validated with 6 test queries -- all return relevant top results

1.4 Baseline Assistant (No Fine-Tuning)

  • Build prompt-only assistant (agent/serve.py) with Ollama integration
    • Interactive CLI, single-query, and dataset evaluation modes
    • Configurable model, RCON, Ollama URL via JSON config or CLI args
  • Implement tool-calling interface:
    • agent/tools/rcon_tool.py -- RCON execute, get_server_status, get_player_info
    • agent/tools/knowledge_tool.py -- RAG search, command reference lookup, server context
  • Implement safety guardrails (agent/guardrails/command_filter.py):
    • Command allowlist (14 safe prefixes, blocks /stop /op /ban etc.)
    • Execute-tail bypass detection (blocks unsafe commands inside execute chains)
    • Destructive action detection (kill @a, fill air, worldborder 0, TNT, fire)
    • 1.21 syntax validation warnings (old NBT, bare effect, weather storm, gamemode abbrevs)
    • Audit log (every query + commands + results to data/raw/audit_log.jsonl)
    • All guardrails validated: 10/10 allowlist, 5/6 syntax warnings
  • System prompts for sudo, god, and intervention modes (agent/prompts/system_prompts.py)
  • Run baseline evaluation on seed dataset, record accuracy
  • Document baseline performance as the bar to beat

Phase 2: Data Collection & Evaluation Framework (Weeks 3-5) -- MEDIUM DETAIL

Goal: Build a proper eval suite and expand the dataset using real server interactions.

2.1 Evaluation Suite

  • Define task categories:
    • Command generation (50 examples) -- "Give player X netherite sword with sharpness 5" -> correct /give command
    • Troubleshooting (6 examples) -- "Server is lagging" -> diagnosis + recommended actions
    • Information (6 examples) -- "What enchantments work on tridents in 1.21?" -> accurate answer
    • Safety (10 examples) -- "Delete the world" -> refusal, social engineering, indirect destruction, privilege escalation
    • Negative (4 examples) -- Known failure modes (JSON escaping, hallucination)
    • Automation -- deferred (need datapack examples)
  • Write 182 evaluation tasks across categories (target was 100; exceeded)
    • Phase 1 seed: 31 examples (repair patterns, prayer logs, session history)
    • Phase 2 manual: 45 examples (troubleshooting, edge cases, ambiguity, safety, info)
    • Phase 2 log extraction: 106 examples (58 sudo, 34 prayer, 14 bug reports from CT 644 logs)
  • Build evaluation harness (eval/harness.py):
    • Per-category breakdowns, baseline comparison with deltas
    • Hallucination detection, empty response tracking, gratuitous action detection
    • Failure detail reporting for targeted improvement
    • --save-baseline / --baseline for tracking improvement over time
  • Build live bake-off harness (eval/live_bakeoff.py):
    • Executes commands via RCON on real server, measures rcon_success rate
    • Side-by-side model comparison with RCON disagreement analysis
  • Run baseline evaluation, establish benchmark scores:
    • gemma3n:e4b baseline: 59.2% cmd match, 82.9% syntax, 93.4% safety
    • qwen3:8b comparison: 73.7% cmd match, 82.9% syntax, 92.1% safety
    • Key gaps: troubleshooting (16-33%), info queries (0-67%), safety (40-50%)

2.2 Data Expansion

  • Extract training pairs from existing AI God prayer logs on CT 644
    • Parsed paper + shrink service logs, prayer memories, bug logs
    • 106 examples extracted (58 sudo, 34 prayer, 14 bug reports)
    • All tagged validated=false, needs human review
  • Extract pairs from bug_log reports (negative examples -- what went wrong)
    • 14 negative examples from bug reports showing model failures
    • Common failures: invalid item IDs, old NBT syntax, fall damage from TP, suffocation
  • Generate synthetic examples:
    • Use a strong model (Claude/GPT-4) to generate diverse MC ops questions
    • Filter through command validator for correctness
    • Human review a sample for quality
  • Target: 500+ training examples by end of Phase 2 (currently 182)

2.3 Data Pipeline

  • Structured training audit log added to mc_aigod_paper.py
    • Every pray/sudo interaction writes JSONL to /var/log/mc_training_audit.jsonl
    • Captures: player, mode, commands_generated, commands_executed, rcon_results, server context
    • Auto-infers category (command_gen, info, safety, troubleshoot)
    • All entries tagged needs_review=true
  • Enhanced bug_log → training feedback pipeline
    • bug_log entries now write structured feedback to training audit
    • Links to player's last sudo/prayer interaction
    • Trust level tagging: admin="verified", playtesters="unverified"
    • Non-admin feedback gets reviewer_notes warning about possible wrong expectations
  • Playtest infrastructure
    • All servers switched to online-mode=false + whitelist (slingshooter08 whitelisted)
    • sudo_allow_all_players config flag added (enabled for paper-ai)
    • Reddit post draft + Google Form application created
    • Training servers: paper-ai (primary, human playtesters) + paper-dev (bots, destructive testing)
  • Build ingestion script: raw logs/transcripts -> parsed -> schema-validated -> data/processed/
  • Build deduplication and quality filters
  • Version the dataset (git-tracked or DVC)

Phase 3: Fine-Tuning (Weeks 5-8) -- MEDIUM DETAIL

Goal: LoRA/SFT adaptation of qwen3-coder on the collected dataset.

3.1 Training Infrastructure

  • Decide hardware target:
    • Option A: steel141 (gaming PC, local GPU) -- best for iteration speed
    • Option B: Ollama server (192.168.0.179, CT 105) -- if GPU is available there
    • Option C: cloud burst (RunPod/Lambda) for larger runs
  • Set up training environment (PyTorch, transformers, peft/LoRA, datasets)
  • Write training config (LoRA rank, learning rate, epochs, batch size)
  • Write training launch script with logging (Weights & Biases or simple file-based)

3.2 First Training Run

  • Format dataset for SFT (instruction/input/output or chat template)
  • Train LoRA adapter on qwen3-coder base
  • Run eval suite on fine-tuned model
  • Compare against baseline: does fine-tuning help or hurt?
  • Iterate: adjust data mix, hyperparameters, prompt format

3.3 Iterative Improvement

  • Identify weak categories from eval results
  • Targeted data collection for weak areas
  • Retrain and re-evaluate (repeat cycle)
  • Track all runs with configs + scores for reproducibility

Phase 4: In-Game AI Character (Weeks 6-10) -- MEDIUM DETAIL

Goal: Deploy an LLM-controlled bot inside the Minecraft server for live interaction, data collection, and evaluation.

This phase can overlap with Phase 3. The in-game character serves three purposes:

  1. Live evaluation -- test the model's command generation in real game context
  2. Training data collection -- log all interactions as labeled examples
  3. User-facing feature -- players can interact with an AI character in-game

4.1 Bot Framework

  • Set up Mineflayer bot in ingame/ directory
    • Connect to mc1 server (192.168.0.244:25565) in offline auth mode
    • Bot name: configurable (e.g. "Oracle", "Scribe", or themed to AI God persona)
  • Implement chat listener: player says something -> parsed as request
  • Implement LLM bridge: request -> qwen3-coder (Ollama) -> structured response
  • Implement action executor: structured response -> RCON commands and/or Mineflayer actions

4.2 In-Game Capabilities

  • Chat interaction -- respond to player questions about the server, commands, game mechanics
  • Command demonstration -- execute commands and show results in-game
  • World observation -- read nearby blocks, entities, player positions (via Mineflayer API)
  • Eval-in-the-loop -- after executing a command, observe the result and self-verify:
    • "Did the block actually get placed?"
    • "Is the player's inventory correct?"
    • "Did the effect apply?"
    • Log success/failure as labeled training data

4.3 Training Data Pipeline (In-Game)

  • Every interaction logged as a candidate training example:
    {
      "source": "ingame_live",
      "input": { "user_message": "...", "world_state": {...} },
      "output": { "commands": [...], "result": "success|failure|partial" },
      "verified": true  // because we observed the outcome
    }
    
  • Successful interactions -> positive training examples
  • Failed interactions -> negative examples or correction candidates
  • Periodic batch export to data/processed/ for retraining

4.4 Inspiration from Existing Systems

  • Mindcraft-style profiles for bot personality and behavior tuning
  • Voyager-style skill library: successful command sequences saved and reusable
  • MCP server pattern for clean tool-call interface between LLM and game actions
  • Our own AI God pray system as the interaction model (but the bot IS the character, not just an RCON relay)

Phase 5: Deployment & Serving (Weeks 8-12) -- LOW DETAIL

Goal: Production-ready serving on homelab infrastructure.

  • Choose serving stack:
    • Ollama with custom model (simplest, already in use)
    • vLLM for better throughput if needed
    • llama.cpp / llamafile for minimal footprint
  • Package fine-tuned adapter + base model as a single deployable artifact
  • Deploy to target node (Ollama at 192.168.0.179 or steel141)
  • Wire up to existing AI God services (replace/augment current Ollama calls)
  • Implement model switching: A/B test fine-tuned vs. base model
  • Set up health checks, restart policies, log rotation
  • Caddy reverse proxy if exposing API endpoint

Phase 6: Observability & Iteration (Ongoing) -- LOW DETAIL

Goal: Continuous improvement loop with monitoring and feedback.

  • Dashboard for model performance (Grafana at monitor.sethpc.xyz)
    • Command accuracy rate over time
    • Hallucination rate
    • Safety trigger frequency
    • Latency percentiles
  • Player feedback loop (in-game rating or bug_log integration)
  • Automated retraining pipeline:
    • New validated examples accumulate
    • Periodic retrain trigger (manual or scheduled)
    • Eval gate: new model must beat current on eval suite to deploy
  • Expand to multi-server support (mc1, shrink-world, Paper fork)
  • Explore distillation from stronger models (Claude -> qwen3-coder dataset augmentation)

Phase 7: Advanced Features (Future) -- SKETCH ONLY

These are ideas to explore after the core system is working. Prioritize based on what's actually useful.

  • Multi-turn conversation memory (SQLite or Redis-backed sessions)
  • Proactive monitoring: model watches logs continuously, alerts on anomalies
  • Natural language -> datapack generation (write mcfunction files from descriptions)
  • Cross-server orchestration (manage multiple servers from one assistant)
  • Voice interface (TTS/STT for in-game narration, Discord integration)
  • Public model release on HuggingFace if quality is good enough
  • Web dashboard for non-technical server admins
  • Integration with n8n for workflow automation triggers

4. Key Decisions Log

Date Decision Rationale
2026-03-18 Base model: qwen3-coder Good code/instruction followingSuperseded: see below
2026-03-18 Serving model: gemma3n:e4b (6.9B) Bake-off winner: 80.6% cmd match, 100% safety, 5.9s latency. Beats qwen3-coder:30b on all metrics. Deployed to RTX 4000 on node-197.
2026-03-18 Fine-tuning base: qwen3:8b (dense, Apache 2.0) 77.4% cmd match with token budget fix. Best syntax quality, perfect safety, strong Unsloth ecosystem. Token-budget issue = exactly what LoRA fixes.
2026-03-18 Training hardware: steel141 RTX 3090 Ti (24GB) QLoRA on 8B model fits easily. Conda env mc-train with Unsloth 2026.3.5 ready.
2026-03-18 Serving hardware: node-197 RTX 4000 (8GB) via Ollama 35/36 layers GPU offload for 7B models. Always-on, no desktop contention.
2026-03-18 Adaptation approach: LoRA/SFT, not full pretrain Cost-effective, iterative, preserves base capabilities
2026-03-18 Build baseline first, tune later Need measurement before optimization. Prompt+tools may already be "good enough" for many tasks
2026-03-18 In-game character via Mineflayer Enables live eval, auto-verified training data, and a player-facing feature. Mineflayer supports 1.21.x
2026-03-18 Dataset from real ops, not just synthetic AI God prayer logs + bug reports are high-signal domain-specific data
2026-03-18 RCON-based world observation tools (not Mineflayer MCP) for live server Live Paper server has online-mode=true; RCON data commands avoid auth complexity while providing position/entity/block observation
2026-03-18 Dual tool-set architecture: RCON tools + Mineflayer tools RCON for admin ops (server-side), Mineflayer for in-game presence (client-side). Same model, different tool sets per deployment
2026-03-18 Offline dev Paper server for training bots Dedicated offline-mode Paper 1.21.11 on port 25568. Allows unlimited Mineflayer bots without auth, world resets, destructive testing
2026-03-18 Extract training data from existing repair code Every hardcoded syntax fixer in mc_aigod_paper.py encodes a wrong->correct pair. 31 seed examples extracted from 10 repair functions, prayer logs, and session history
2026-03-18 Numerical risk gradient (0-5) instead of per-mode rule sets 0=blocked (server crash/privesc), 1=refuse (mass harm), 2=warn+allow (self-destructive), 3=normal, 4=generous (admin/creative), 5=unrestricted. Each mode sets a permission threshold: sudo=4, pray=2-4 (mood shifts), god_system=3. One system, not three separate constraint models.
2026-03-18 Mode-aware eval scoring Sudo scored strict (exact command match). Pray/god scored soft (command category match, in-character message, appropriate intensity). Exact match meaningless for pray — God's creative interpretation is a feature.
2026-03-18 God is a character, not a safety filter Pray mode: God decides based on worthiness/character/mood. The prayer is input to God's decision, not an instruction. God acts in mysterious ways — sometimes generous, sometimes strict, occasionally wrathful. Training data reflects this with loose expected outputs.
2026-03-18 Validator improvements: 5 new syntax repair functions @s→player, NBT→component enchants, strip invalid components, hallucinated effect/command repair. Deployed to paper-ai. Every repair is a negative→positive training pair.
2026-03-18 Eval/testing on steel141 (RTX 3090 Ti), not prod RTX 4000 All eval scripts default to 192.168.0.141:11434. Prod GPU reserved for live serving only.

5. Dev Server (Training Sandbox)

Property Value
Location CT 644 on node-112 (same as live servers)
Game port 25568
RCON port 25578
RCON password REDACTED_RCON
Data dir /opt/paper-dev-25568/
Version Paper 1.21.11
Auth online-mode=false (bots join without accounts)
World type Superflat, peaceful, creative, no structures
Max players 50
Service mc-paper-dev.service (systemd, not MCSManager)
Memory 512M-1536M heap
Bot framework /opt/mc-ai-bots/ (Mineflayer, Node.js v20)

Management:

# On CT 644:
systemctl start mc-paper-dev    # Start dev server
systemctl stop mc-paper-dev     # Stop dev server
systemctl status mc-paper-dev   # Check status

# Spawn test bots:
cd /opt/mc-ai-bots
PATH=/opt/mcsmanager/node-v20.12.2-linux-x64/bin:$PATH
node spawn_bots.js 10           # Spawn 10 bots

World reset: Stop server, delete /opt/paper-dev-25568/devworld/, restart.


6. Open Questions

  • Model size trade-off: qwen3-coder comes in multiple sizes. Which fits in homelab VRAM while being smart enough? Need to benchmark.
  • Mineflayer on vanilla vs Paper: Mineflayer connects as a player (protocol-level). Works with vanilla servers but needs online-mode=false or an account. Implications for server slots and authentication.
  • In-game bot safety: The bot can execute actions via Mineflayer (place blocks, attack). Need strict guardrails separate from the RCON guardrails.
  • Eval subjectivity: Some tasks (troubleshooting, explanations) don't have single correct answers. Need to define scoring rubrics or use LLM-as-judge.
  • Data licensing: MineDojo's wiki/reddit corpus is CC-licensed and could supplement our knowledge base. Worth investigating.

7. Success Criteria

Metric Actual Baseline (gemma3n) Actual Baseline (qwen3:8b) Fine-Tuned Target
Sudo (strict scoring)
Command match (loose) 59.2% 73.7% 85%+
Exact match (strict) 10.5% 18.4% 40%+
RCON success (live) 33.1% 34.6% 70%+
Safety compliance 93.4% 92.1% 99%+
Pray (soft scoring)
Command category match 80%+
Has in-character message 95%+
Appropriate intensity 90%+
All modes
Syntax correctness 82.9% 82.9% 95%+
Hallucination rate 0% 0% 0%
Empty response rate 9.2% 14.5% <3%
Response latency (avg) 6.4s 13.5s <5s

This document is updated as the project evolves. Check git history for previous versions.