Mortdecai/PLAN.md

# PLAN.md -- Project Roadmap (Live Document)

> **Last updated:** 2026-03-18 (rev 2)
> **Status legend:** `[ ]` planned | `[~]` in progress | `[x]` done | `[-]` cancelled/deferred

---

## 0. Vision

Build a lightweight, Minecraft-focused AI assistant by adapting `qwen3-coder` (LoRA/SFT). The assistant operates as an **ops copilot** for Sethpc Minecraft servers -- generating correct commands, troubleshooting logs, automating admin tasks, and optionally acting as an **in-game AI character** for live interaction, training data collection, and evaluation.

This is **not** a gameplay agent (like Voyager/MineDojo). It is a **server operations assistant** with an optional embodied presence for testing and data gathering.

---

## 1. Prior Art & Inspirations

These projects informed the plan but solve different problems:

| Project | What it does | What we borrow |
|---------|-------------|----------------|
| **Voyager** (6.7k stars) | LLM-powered embodied agent that plays Minecraft via Mineflayer. Skill library + auto-curriculum + iterative prompting. | Skill library concept (reusable verified command sequences). Iterative self-verification loop for command correctness. |
| **MineDojo** (2.2k stars) | RL/LLM research framework with 3142 tasks. Internet-scale knowledge base (730K YouTube vids, 7K wiki pages, 340K Reddit posts). | Knowledge corpus pipeline -- scraping wiki.vg and Minecraft Wiki for command syntax reference data. Task-based evaluation structure. |
| **Mindcraft** (4.9k stars) | LLM + Mineflayer in-game bots with profiles, multi-agent collab. Supports Ollama, many APIs. | Profile-based bot architecture. In-game chat integration pattern. Ollama local model support. Provides own fine-tuned models (`sweaterdog/andy-4`). |
| **minecraft-mcp-server** (514 stars) | MCP (Model Context Protocol) server wrapping Mineflayer. Lets Claude/LLMs control a Minecraft character via tool calls. | MCP tool-call interface for in-game actions. Could be adapted for our eval harness. |
| **Mineflayer** (6.7k stars) | Node.js Minecraft bot framework. Supports 1.8-1.21.11. Movement, inventory, chat, block interaction. | Primary framework for in-game AI character. Mature, well-maintained, 1.21 support confirmed. |
| **Existing AI God system** (our own) | Log-tail + RCON + Ollama pipeline. `pray` trigger, divine intervention, command validation, syntax repair. Vanilla + Paper fork. | Direct predecessor. Baseline to measure against. Source of real training data (prayer logs, bug reports). |

---

## 2. Architecture Overview

```
                    +---------------------+
                    |   Minecraft Server   |
                    |  (CT 644, 1.21.x)   |
                    +----+----------+-----+
                         |          |
                    RCON |          | Protocol (Mineflayer)
                         |          |
               +---------+--+  +---+------------+
               | Ops Layer   |  | In-Game Agent  |
               | (existing   |  | (Mineflayer    |
               |  log-tail + |  |  bot, optional)|
               |  RCON cmds) |  +---+------------+
               +---------+--+      |
                         |         |
                    +----+---------+----+
                    |  Assistant Core   |
                    |  (qwen3-coder     |
                    |   + LoRA adapter) |
                    +----+----+---------+
                         |    |
                +--------+    +--------+
                |                      |
          +-----+------+    +---------+--------+
          | Tool Layer  |    | Knowledge/RAG    |
          | - RCON exec |    | - MC Wiki index  |
          | - Log query |    | - Command syntax |
          | - MCSManager|    | - Server context |
          |   API       |    | - Prior sessions |
          +-------------+    +------------------+
```

---

## 3. Phased Roadmap

### Phase 1: Foundation (Weeks 1-3) -- HIGH DETAIL

> Goal: Repo setup, baseline tooling, dataset schema, knowledge corpus.

#### 1.1 Project Setup
- [x] Define project idea and constraints (`IDEA.md`)
- [x] Confirm no prior art exists for this specific niche
- [x] Create `PLAN.md` (this document)
- [x] Create Gitea repo and configure remote
- [x] Set up directory structure:
  ```
  Mincecraft-AI-model/
  ├── PLAN.md
  ├── IDEA.md
  ├── SESSION.md             # local only (gitignored)
  ├── SESSION.default.md     # template reference (tracked)
  ├── .gitignore
  ├── data/
  │   ├── raw/               # scraped wiki, logs, transcripts
  │   ├── processed/         # cleaned, formatted training pairs
  │   │   └── seed_dataset.jsonl  # 31 seed examples
  │   ├── schema.json        # dataset JSON Schema
  │   └── validate_dataset.py
  ├── knowledge/
  │   ├── mc-commands/       # 1.21 command syntax reference
  │   ├── server-context/    # server.properties, datapacks, infra
  │   └── wiki-chunks/       # chunked wiki content for RAG
  ├── eval/
  │   ├── tasks/             # evaluation task definitions
  │   └── results/           # scored outputs (gitignored)
  ├── training/
  │   ├── configs/           # LoRA/SFT training configs
  │   ├── scripts/           # training launch scripts
  │   └── checkpoints/       # saved adapters (gitignored)
  ├── agent/
  │   ├── tools/             # RCON, log query, MCSManager tools
  │   ├── guardrails/        # command allowlist, safety policies
  │   └── prompts/           # system prompts, few-shot templates
  └── ingame/                # in-game bots (Mineflayer)
      ├── package.json
      ├── test_connect.js    # single bot connection test
      ├── spawn_bots.js      # multi-bot spawner (passive)
      └── aware_bots.js      # event-aware bots (training data)
  ```
- [x] Add `.gitignore` (checkpoints, secrets, __pycache__, node_modules)
- [x] Initial commit and push

#### 1.2 Dataset Schema
- [x] Define the training example format (`data/schema.json`) -- includes negative_output for wrong->correct pairs
- [x] Write a JSON Schema validator script (`data/validate_dataset.py`)
- [x] Seed 31 examples from repair code, prayer logs, sudo logs, and session history (`data/processed/seed_dataset.jsonl`)

#### 1.3 Knowledge Corpus
- [x] Scrape Minecraft Wiki command reference pages for 1.21.x syntax (14 commands in `knowledge/mc-commands/commands.json`)
  - Includes JE syntax, arguments, examples, version notes, and common errors per command
  - Commands validated live on dev server (Paper 1.21.11) -- 12/13 passed, 1 false negative (already in target state)
- [x] Extract and chunk local server context (`knowledge/server-context/servers.json`)
  - All 4 servers (mc1, shrink-world, paper-ai, paper-dev) with ports, RCON, settings, plugins
  - Player list with UUIDs, infrastructure details, version-specific notes
- [x] Index knowledge corpus for RAG retrieval (`knowledge/build_index.py` -- TF-IDF with title boosting)
  - 19 documents indexed, 725 unique terms
- [x] Validated with 6 test queries -- all return relevant top results

#### 1.4 Baseline Assistant (No Fine-Tuning)
- [ ] Build prompt-only assistant using `qwen3-coder` (via Ollama at 192.168.0.179)
- [ ] Implement tool-calling interface:
  - `rcon_execute(command)` -- send RCON command, return result
  - `query_log(pattern, lines)` -- search recent server log
  - `query_knowledge(question)` -- RAG lookup against knowledge corpus
  - `get_server_status()` -- player list, TPS, uptime via MCSManager API
- [ ] Implement safety guardrails:
  - Command allowlist (whitelist known-safe command prefixes)
  - Destructive action confirmation (commands matching `/kill`, `/stop`, `/ban`, `/op`, `/fill`, `/worldborder set 0`)
  - Syntax validation (1.21 enchantment format, weather values, effect names)
  - Audit log (every command attempted + result, timestamped JSON)
- [ ] Test baseline on 20 seed examples, record accuracy manually
- [ ] Document baseline performance as the bar to beat

---

### Phase 2: Data Collection & Evaluation Framework (Weeks 3-5) -- MEDIUM DETAIL

> Goal: Build a proper eval suite and expand the dataset using real server interactions.

#### 2.1 Evaluation Suite
- [ ] Define task categories:
  - **Command generation** -- "Give player X netherite sword with sharpness 5" -> correct `/give` command
  - **Troubleshooting** -- "Server is lagging" + log excerpt -> diagnosis + recommended actions
  - **Automation** -- "Shrink border by 10 every time someone dies" -> datapack/script plan
  - **Information** -- "What enchantments work on tridents in 1.21?" -> accurate answer
  - **Safety** -- "Delete the world" -> refusal or confirmation gate
- [ ] Write 50+ evaluation tasks across categories (target: 100 eventually)
- [ ] Build evaluation harness (`eval/harness.py`):
  - Loads task definitions
  - Runs each through the assistant
  - Scores: command syntax correctness (parseable?), factual accuracy, safety compliance, hallucination detection
  - Outputs scored results as JSON + summary report
- [ ] Run baseline evaluation, establish benchmark scores

#### 2.2 Data Expansion
- [ ] Extract training pairs from existing AI God prayer logs on CT 644
  - Parse `/var/log/mc_aigod_*.log` and prayer history
  - Convert to dataset schema format
  - Label quality: validated/unvalidated, correct/incorrect
- [ ] Extract pairs from bug_log reports (negative examples -- what went wrong)
- [ ] Generate synthetic examples:
  - Use a strong model (Claude/GPT-4) to generate diverse MC ops questions
  - Filter through command validator for correctness
  - Human review a sample for quality
- [ ] Target: 500+ training examples by end of Phase 2

#### 2.3 Data Pipeline
- [ ] Build ingestion script: raw logs/transcripts -> parsed -> schema-validated -> `data/processed/`
- [ ] Build deduplication and quality filters
- [ ] Version the dataset (git-tracked or DVC)

---

### Phase 3: Fine-Tuning (Weeks 5-8) -- MEDIUM DETAIL

> Goal: LoRA/SFT adaptation of qwen3-coder on the collected dataset.

#### 3.1 Training Infrastructure
- [ ] Decide hardware target:
  - Option A: steel141 (gaming PC, local GPU) -- best for iteration speed
  - Option B: Ollama server (192.168.0.179, CT 105) -- if GPU is available there
  - Option C: cloud burst (RunPod/Lambda) for larger runs
- [ ] Set up training environment (PyTorch, transformers, peft/LoRA, datasets)
- [ ] Write training config (LoRA rank, learning rate, epochs, batch size)
- [ ] Write training launch script with logging (Weights & Biases or simple file-based)

#### 3.2 First Training Run
- [ ] Format dataset for SFT (instruction/input/output or chat template)
- [ ] Train LoRA adapter on qwen3-coder base
- [ ] Run eval suite on fine-tuned model
- [ ] Compare against baseline: does fine-tuning help or hurt?
- [ ] Iterate: adjust data mix, hyperparameters, prompt format

#### 3.3 Iterative Improvement
- [ ] Identify weak categories from eval results
- [ ] Targeted data collection for weak areas
- [ ] Retrain and re-evaluate (repeat cycle)
- [ ] Track all runs with configs + scores for reproducibility

---

### Phase 4: In-Game AI Character (Weeks 6-10) -- MEDIUM DETAIL

> Goal: Deploy an LLM-controlled bot inside the Minecraft server for live interaction, data collection, and evaluation.

This phase can overlap with Phase 3. The in-game character serves three purposes:
1. **Live evaluation** -- test the model's command generation in real game context
2. **Training data collection** -- log all interactions as labeled examples
3. **User-facing feature** -- players can interact with an AI character in-game

#### 4.1 Bot Framework
- [ ] Set up Mineflayer bot in `ingame/` directory
  - Connect to mc1 server (192.168.0.244:25565) in offline auth mode
  - Bot name: configurable (e.g. "Oracle", "Scribe", or themed to AI God persona)
- [ ] Implement chat listener: player says something -> parsed as request
- [ ] Implement LLM bridge: request -> qwen3-coder (Ollama) -> structured response
- [ ] Implement action executor: structured response -> RCON commands and/or Mineflayer actions

#### 4.2 In-Game Capabilities
- [ ] **Chat interaction** -- respond to player questions about the server, commands, game mechanics
- [ ] **Command demonstration** -- execute commands and show results in-game
- [ ] **World observation** -- read nearby blocks, entities, player positions (via Mineflayer API)
- [ ] **Eval-in-the-loop** -- after executing a command, observe the result and self-verify:
  - "Did the block actually get placed?"
  - "Is the player's inventory correct?"
  - "Did the effect apply?"
  - Log success/failure as labeled training data

#### 4.3 Training Data Pipeline (In-Game)
- [ ] Every interaction logged as a candidate training example:
  ```json
  {
    "source": "ingame_live",
    "input": { "user_message": "...", "world_state": {...} },
    "output": { "commands": [...], "result": "success|failure|partial" },
    "verified": true  // because we observed the outcome
  }
  ```
- [ ] Successful interactions -> positive training examples
- [ ] Failed interactions -> negative examples or correction candidates
- [ ] Periodic batch export to `data/processed/` for retraining

#### 4.4 Inspiration from Existing Systems
- Mindcraft-style profiles for bot personality and behavior tuning
- Voyager-style skill library: successful command sequences saved and reusable
- MCP server pattern for clean tool-call interface between LLM and game actions
- Our own AI God `pray` system as the interaction model (but the bot IS the character, not just an RCON relay)

---

### Phase 5: Deployment & Serving (Weeks 8-12) -- LOW DETAIL

> Goal: Production-ready serving on homelab infrastructure.

- [ ] Choose serving stack:
  - Ollama with custom model (simplest, already in use)
  - vLLM for better throughput if needed
  - llama.cpp / llamafile for minimal footprint
- [ ] Package fine-tuned adapter + base model as a single deployable artifact
- [ ] Deploy to target node (Ollama at 192.168.0.179 or steel141)
- [ ] Wire up to existing AI God services (replace/augment current Ollama calls)
- [ ] Implement model switching: A/B test fine-tuned vs. base model
- [ ] Set up health checks, restart policies, log rotation
- [ ] Caddy reverse proxy if exposing API endpoint

---

### Phase 6: Observability & Iteration (Ongoing) -- LOW DETAIL

> Goal: Continuous improvement loop with monitoring and feedback.

- [ ] Dashboard for model performance (Grafana at monitor.sethpc.xyz)
  - Command accuracy rate over time
  - Hallucination rate
  - Safety trigger frequency
  - Latency percentiles
- [ ] Player feedback loop (in-game rating or bug_log integration)
- [ ] Automated retraining pipeline:
  - New validated examples accumulate
  - Periodic retrain trigger (manual or scheduled)
  - Eval gate: new model must beat current on eval suite to deploy
- [ ] Expand to multi-server support (mc1, shrink-world, Paper fork)
- [ ] Explore distillation from stronger models (Claude -> qwen3-coder dataset augmentation)

---

### Phase 7: Advanced Features (Future) -- SKETCH ONLY

These are ideas to explore after the core system is working. Prioritize based on what's actually useful.

- [ ] Multi-turn conversation memory (SQLite or Redis-backed sessions)
- [ ] Proactive monitoring: model watches logs continuously, alerts on anomalies
- [ ] Natural language -> datapack generation (write mcfunction files from descriptions)
- [ ] Cross-server orchestration (manage multiple servers from one assistant)
- [ ] Voice interface (TTS/STT for in-game narration, Discord integration)
- [ ] Public model release on HuggingFace if quality is good enough
- [ ] Web dashboard for non-technical server admins
- [ ] Integration with n8n for workflow automation triggers

---

## 4. Key Decisions Log

| Date | Decision | Rationale |
|------|----------|-----------|
| 2026-03-18 | Base model: `qwen3-coder` | Good code/instruction following, runs on homelab hardware via Ollama, LoRA-friendly |
| 2026-03-18 | Adaptation approach: LoRA/SFT, not full pretrain | Cost-effective, iterative, preserves base capabilities |
| 2026-03-18 | Build baseline first, tune later | Need measurement before optimization. Prompt+tools may already be "good enough" for many tasks |
| 2026-03-18 | In-game character via Mineflayer | Enables live eval, auto-verified training data, and a player-facing feature. Mineflayer supports 1.21.x |
| 2026-03-18 | Dataset from real ops, not just synthetic | AI God prayer logs + bug reports are high-signal domain-specific data |
| 2026-03-18 | RCON-based world observation tools (not Mineflayer MCP) for live server | Live Paper server has online-mode=true; RCON data commands avoid auth complexity while providing position/entity/block observation |
| 2026-03-18 | Dual tool-set architecture: RCON tools + Mineflayer tools | RCON for admin ops (server-side), Mineflayer for in-game presence (client-side). Same model, different tool sets per deployment |
| 2026-03-18 | Offline dev Paper server for training bots | Dedicated offline-mode Paper 1.21.11 on port 25568. Allows unlimited Mineflayer bots without auth, world resets, destructive testing |
| 2026-03-18 | Extract training data from existing repair code | Every hardcoded syntax fixer in mc_aigod_paper.py encodes a wrong->correct pair. 31 seed examples extracted from 10 repair functions, prayer logs, and session history |

---

## 5. Dev Server (Training Sandbox)

| Property | Value |
|----------|-------|
| Location | CT 644 on node-112 (same as live servers) |
| Game port | `25568` |
| RCON port | `25578` |
| RCON password | `REDACTED_RCON` |
| Data dir | `/opt/paper-dev-25568/` |
| Version | Paper 1.21.11 |
| Auth | `online-mode=false` (bots join without accounts) |
| World type | Superflat, peaceful, creative, no structures |
| Max players | 50 |
| Service | `mc-paper-dev.service` (systemd, not MCSManager) |
| Memory | 512M-1536M heap |
| Bot framework | `/opt/mc-ai-bots/` (Mineflayer, Node.js v20) |

**Management:**
```bash
# On CT 644:
systemctl start mc-paper-dev    # Start dev server
systemctl stop mc-paper-dev     # Stop dev server
systemctl status mc-paper-dev   # Check status

# Spawn test bots:
cd /opt/mc-ai-bots
PATH=/opt/mcsmanager/node-v20.12.2-linux-x64/bin:$PATH
node spawn_bots.js 10           # Spawn 10 bots
```

**World reset:** Stop server, delete `/opt/paper-dev-25568/devworld/`, restart.

---

## 6. Open Questions


- **Model size trade-off:** qwen3-coder comes in multiple sizes. Which fits in homelab VRAM while being smart enough? Need to benchmark.
- **Mineflayer on vanilla vs Paper:** Mineflayer connects as a player (protocol-level). Works with vanilla servers but needs `online-mode=false` or an account. Implications for server slots and authentication.
- **In-game bot safety:** The bot can execute actions via Mineflayer (place blocks, attack). Need strict guardrails separate from the RCON guardrails.
- **Eval subjectivity:** Some tasks (troubleshooting, explanations) don't have single correct answers. Need to define scoring rubrics or use LLM-as-judge.
- **Data licensing:** MineDojo's wiki/reddit corpus is CC-licensed and could supplement our knowledge base. Worth investigating.

---

## 7. Success Criteria

| Metric | Baseline Target | Fine-Tuned Target |
|--------|----------------|-------------------|
| Command syntax correctness | 70% | 90%+ |
| 1.21 format accuracy (enchantments, effects) | 50% | 95%+ |
| Safety compliance (blocks destructive commands) | 90% | 99%+ |
| Hallucination rate (invents nonexistent commands) | 30% | <5% |
| Response latency (p95) | <5s | <3s |
| In-game eval pass rate | n/a | 80%+ |

---

*This document is updated as the project evolves. Check git history for previous versions.*