- knowledge/mc-commands/commands.json: 14 MC commands with JE syntax, args, examples, common errors, 1.21 version notes - knowledge/server-context/servers.json: all 4 servers (mc1, shrink, paper-ai, paper-dev) with full config - knowledge/build_index.py: TF-IDF indexer + search function (19 docs, 725 terms) - All command syntax validated live on dev server via RCON (12/13 passed) - PLAN.md: mark Phase 1.3 complete
20 KiB
PLAN.md -- Project Roadmap (Live Document)
Last updated: 2026-03-18 (rev 2) Status legend:
[ ]planned |[~]in progress |[x]done |[-]cancelled/deferred
0. Vision
Build a lightweight, Minecraft-focused AI assistant by adapting qwen3-coder (LoRA/SFT). The assistant operates as an ops copilot for Sethpc Minecraft servers -- generating correct commands, troubleshooting logs, automating admin tasks, and optionally acting as an in-game AI character for live interaction, training data collection, and evaluation.
This is not a gameplay agent (like Voyager/MineDojo). It is a server operations assistant with an optional embodied presence for testing and data gathering.
1. Prior Art & Inspirations
These projects informed the plan but solve different problems:
| Project | What it does | What we borrow |
|---|---|---|
| Voyager (6.7k stars) | LLM-powered embodied agent that plays Minecraft via Mineflayer. Skill library + auto-curriculum + iterative prompting. | Skill library concept (reusable verified command sequences). Iterative self-verification loop for command correctness. |
| MineDojo (2.2k stars) | RL/LLM research framework with 3142 tasks. Internet-scale knowledge base (730K YouTube vids, 7K wiki pages, 340K Reddit posts). | Knowledge corpus pipeline -- scraping wiki.vg and Minecraft Wiki for command syntax reference data. Task-based evaluation structure. |
| Mindcraft (4.9k stars) | LLM + Mineflayer in-game bots with profiles, multi-agent collab. Supports Ollama, many APIs. | Profile-based bot architecture. In-game chat integration pattern. Ollama local model support. Provides own fine-tuned models (sweaterdog/andy-4). |
| minecraft-mcp-server (514 stars) | MCP (Model Context Protocol) server wrapping Mineflayer. Lets Claude/LLMs control a Minecraft character via tool calls. | MCP tool-call interface for in-game actions. Could be adapted for our eval harness. |
| Mineflayer (6.7k stars) | Node.js Minecraft bot framework. Supports 1.8-1.21.11. Movement, inventory, chat, block interaction. | Primary framework for in-game AI character. Mature, well-maintained, 1.21 support confirmed. |
| Existing AI God system (our own) | Log-tail + RCON + Ollama pipeline. pray trigger, divine intervention, command validation, syntax repair. Vanilla + Paper fork. |
Direct predecessor. Baseline to measure against. Source of real training data (prayer logs, bug reports). |
2. Architecture Overview
+---------------------+
| Minecraft Server |
| (CT 644, 1.21.x) |
+----+----------+-----+
| |
RCON | | Protocol (Mineflayer)
| |
+---------+--+ +---+------------+
| Ops Layer | | In-Game Agent |
| (existing | | (Mineflayer |
| log-tail + | | bot, optional)|
| RCON cmds) | +---+------------+
+---------+--+ |
| |
+----+---------+----+
| Assistant Core |
| (qwen3-coder |
| + LoRA adapter) |
+----+----+---------+
| |
+--------+ +--------+
| |
+-----+------+ +---------+--------+
| Tool Layer | | Knowledge/RAG |
| - RCON exec | | - MC Wiki index |
| - Log query | | - Command syntax |
| - MCSManager| | - Server context |
| API | | - Prior sessions |
+-------------+ +------------------+
3. Phased Roadmap
Phase 1: Foundation (Weeks 1-3) -- HIGH DETAIL
Goal: Repo setup, baseline tooling, dataset schema, knowledge corpus.
1.1 Project Setup
- Define project idea and constraints (
IDEA.md) - Confirm no prior art exists for this specific niche
- Create
PLAN.md(this document) - Create Gitea repo and configure remote
- Set up directory structure:
Mincecraft-AI-model/ ├── PLAN.md ├── IDEA.md ├── SESSION.md # local only (gitignored) ├── SESSION.default.md # template reference (tracked) ├── .gitignore ├── data/ │ ├── raw/ # scraped wiki, logs, transcripts │ ├── processed/ # cleaned, formatted training pairs │ │ └── seed_dataset.jsonl # 31 seed examples │ ├── schema.json # dataset JSON Schema │ └── validate_dataset.py ├── knowledge/ │ ├── mc-commands/ # 1.21 command syntax reference │ ├── server-context/ # server.properties, datapacks, infra │ └── wiki-chunks/ # chunked wiki content for RAG ├── eval/ │ ├── tasks/ # evaluation task definitions │ └── results/ # scored outputs (gitignored) ├── training/ │ ├── configs/ # LoRA/SFT training configs │ ├── scripts/ # training launch scripts │ └── checkpoints/ # saved adapters (gitignored) ├── agent/ │ ├── tools/ # RCON, log query, MCSManager tools │ ├── guardrails/ # command allowlist, safety policies │ └── prompts/ # system prompts, few-shot templates └── ingame/ # in-game bots (Mineflayer) ├── package.json ├── test_connect.js # single bot connection test ├── spawn_bots.js # multi-bot spawner (passive) └── aware_bots.js # event-aware bots (training data) - Add
.gitignore(checkpoints, secrets, pycache, node_modules) - Initial commit and push
1.2 Dataset Schema
- Define the training example format (
data/schema.json) -- includes negative_output for wrong->correct pairs - Write a JSON Schema validator script (
data/validate_dataset.py) - Seed 31 examples from repair code, prayer logs, sudo logs, and session history (
data/processed/seed_dataset.jsonl)
1.3 Knowledge Corpus
- Scrape Minecraft Wiki command reference pages for 1.21.x syntax (14 commands in
knowledge/mc-commands/commands.json)- Includes JE syntax, arguments, examples, version notes, and common errors per command
- Commands validated live on dev server (Paper 1.21.11) -- 12/13 passed, 1 false negative (already in target state)
- Extract and chunk local server context (
knowledge/server-context/servers.json)- All 4 servers (mc1, shrink-world, paper-ai, paper-dev) with ports, RCON, settings, plugins
- Player list with UUIDs, infrastructure details, version-specific notes
- Index knowledge corpus for RAG retrieval (
knowledge/build_index.py-- TF-IDF with title boosting)- 19 documents indexed, 725 unique terms
- Validated with 6 test queries -- all return relevant top results
1.4 Baseline Assistant (No Fine-Tuning)
- Build prompt-only assistant using
qwen3-coder(via Ollama at 192.168.0.179) - Implement tool-calling interface:
rcon_execute(command)-- send RCON command, return resultquery_log(pattern, lines)-- search recent server logquery_knowledge(question)-- RAG lookup against knowledge corpusget_server_status()-- player list, TPS, uptime via MCSManager API
- Implement safety guardrails:
- Command allowlist (whitelist known-safe command prefixes)
- Destructive action confirmation (commands matching
/kill,/stop,/ban,/op,/fill,/worldborder set 0) - Syntax validation (1.21 enchantment format, weather values, effect names)
- Audit log (every command attempted + result, timestamped JSON)
- Test baseline on 20 seed examples, record accuracy manually
- Document baseline performance as the bar to beat
Phase 2: Data Collection & Evaluation Framework (Weeks 3-5) -- MEDIUM DETAIL
Goal: Build a proper eval suite and expand the dataset using real server interactions.
2.1 Evaluation Suite
- Define task categories:
- Command generation -- "Give player X netherite sword with sharpness 5" -> correct
/givecommand - Troubleshooting -- "Server is lagging" + log excerpt -> diagnosis + recommended actions
- Automation -- "Shrink border by 10 every time someone dies" -> datapack/script plan
- Information -- "What enchantments work on tridents in 1.21?" -> accurate answer
- Safety -- "Delete the world" -> refusal or confirmation gate
- Command generation -- "Give player X netherite sword with sharpness 5" -> correct
- Write 50+ evaluation tasks across categories (target: 100 eventually)
- Build evaluation harness (
eval/harness.py):- Loads task definitions
- Runs each through the assistant
- Scores: command syntax correctness (parseable?), factual accuracy, safety compliance, hallucination detection
- Outputs scored results as JSON + summary report
- Run baseline evaluation, establish benchmark scores
2.2 Data Expansion
- Extract training pairs from existing AI God prayer logs on CT 644
- Parse
/var/log/mc_aigod_*.logand prayer history - Convert to dataset schema format
- Label quality: validated/unvalidated, correct/incorrect
- Parse
- Extract pairs from bug_log reports (negative examples -- what went wrong)
- Generate synthetic examples:
- Use a strong model (Claude/GPT-4) to generate diverse MC ops questions
- Filter through command validator for correctness
- Human review a sample for quality
- Target: 500+ training examples by end of Phase 2
2.3 Data Pipeline
- Build ingestion script: raw logs/transcripts -> parsed -> schema-validated ->
data/processed/ - Build deduplication and quality filters
- Version the dataset (git-tracked or DVC)
Phase 3: Fine-Tuning (Weeks 5-8) -- MEDIUM DETAIL
Goal: LoRA/SFT adaptation of qwen3-coder on the collected dataset.
3.1 Training Infrastructure
- Decide hardware target:
- Option A: steel141 (gaming PC, local GPU) -- best for iteration speed
- Option B: Ollama server (192.168.0.179, CT 105) -- if GPU is available there
- Option C: cloud burst (RunPod/Lambda) for larger runs
- Set up training environment (PyTorch, transformers, peft/LoRA, datasets)
- Write training config (LoRA rank, learning rate, epochs, batch size)
- Write training launch script with logging (Weights & Biases or simple file-based)
3.2 First Training Run
- Format dataset for SFT (instruction/input/output or chat template)
- Train LoRA adapter on qwen3-coder base
- Run eval suite on fine-tuned model
- Compare against baseline: does fine-tuning help or hurt?
- Iterate: adjust data mix, hyperparameters, prompt format
3.3 Iterative Improvement
- Identify weak categories from eval results
- Targeted data collection for weak areas
- Retrain and re-evaluate (repeat cycle)
- Track all runs with configs + scores for reproducibility
Phase 4: In-Game AI Character (Weeks 6-10) -- MEDIUM DETAIL
Goal: Deploy an LLM-controlled bot inside the Minecraft server for live interaction, data collection, and evaluation.
This phase can overlap with Phase 3. The in-game character serves three purposes:
- Live evaluation -- test the model's command generation in real game context
- Training data collection -- log all interactions as labeled examples
- User-facing feature -- players can interact with an AI character in-game
4.1 Bot Framework
- Set up Mineflayer bot in
ingame/directory- Connect to mc1 server (192.168.0.244:25565) in offline auth mode
- Bot name: configurable (e.g. "Oracle", "Scribe", or themed to AI God persona)
- Implement chat listener: player says something -> parsed as request
- Implement LLM bridge: request -> qwen3-coder (Ollama) -> structured response
- Implement action executor: structured response -> RCON commands and/or Mineflayer actions
4.2 In-Game Capabilities
- Chat interaction -- respond to player questions about the server, commands, game mechanics
- Command demonstration -- execute commands and show results in-game
- World observation -- read nearby blocks, entities, player positions (via Mineflayer API)
- Eval-in-the-loop -- after executing a command, observe the result and self-verify:
- "Did the block actually get placed?"
- "Is the player's inventory correct?"
- "Did the effect apply?"
- Log success/failure as labeled training data
4.3 Training Data Pipeline (In-Game)
- Every interaction logged as a candidate training example:
{ "source": "ingame_live", "input": { "user_message": "...", "world_state": {...} }, "output": { "commands": [...], "result": "success|failure|partial" }, "verified": true // because we observed the outcome } - Successful interactions -> positive training examples
- Failed interactions -> negative examples or correction candidates
- Periodic batch export to
data/processed/for retraining
4.4 Inspiration from Existing Systems
- Mindcraft-style profiles for bot personality and behavior tuning
- Voyager-style skill library: successful command sequences saved and reusable
- MCP server pattern for clean tool-call interface between LLM and game actions
- Our own AI God
praysystem as the interaction model (but the bot IS the character, not just an RCON relay)
Phase 5: Deployment & Serving (Weeks 8-12) -- LOW DETAIL
Goal: Production-ready serving on homelab infrastructure.
- Choose serving stack:
- Ollama with custom model (simplest, already in use)
- vLLM for better throughput if needed
- llama.cpp / llamafile for minimal footprint
- Package fine-tuned adapter + base model as a single deployable artifact
- Deploy to target node (Ollama at 192.168.0.179 or steel141)
- Wire up to existing AI God services (replace/augment current Ollama calls)
- Implement model switching: A/B test fine-tuned vs. base model
- Set up health checks, restart policies, log rotation
- Caddy reverse proxy if exposing API endpoint
Phase 6: Observability & Iteration (Ongoing) -- LOW DETAIL
Goal: Continuous improvement loop with monitoring and feedback.
- Dashboard for model performance (Grafana at monitor.sethpc.xyz)
- Command accuracy rate over time
- Hallucination rate
- Safety trigger frequency
- Latency percentiles
- Player feedback loop (in-game rating or bug_log integration)
- Automated retraining pipeline:
- New validated examples accumulate
- Periodic retrain trigger (manual or scheduled)
- Eval gate: new model must beat current on eval suite to deploy
- Expand to multi-server support (mc1, shrink-world, Paper fork)
- Explore distillation from stronger models (Claude -> qwen3-coder dataset augmentation)
Phase 7: Advanced Features (Future) -- SKETCH ONLY
These are ideas to explore after the core system is working. Prioritize based on what's actually useful.
- Multi-turn conversation memory (SQLite or Redis-backed sessions)
- Proactive monitoring: model watches logs continuously, alerts on anomalies
- Natural language -> datapack generation (write mcfunction files from descriptions)
- Cross-server orchestration (manage multiple servers from one assistant)
- Voice interface (TTS/STT for in-game narration, Discord integration)
- Public model release on HuggingFace if quality is good enough
- Web dashboard for non-technical server admins
- Integration with n8n for workflow automation triggers
4. Key Decisions Log
| Date | Decision | Rationale |
|---|---|---|
| 2026-03-18 | Base model: qwen3-coder |
Good code/instruction following, runs on homelab hardware via Ollama, LoRA-friendly |
| 2026-03-18 | Adaptation approach: LoRA/SFT, not full pretrain | Cost-effective, iterative, preserves base capabilities |
| 2026-03-18 | Build baseline first, tune later | Need measurement before optimization. Prompt+tools may already be "good enough" for many tasks |
| 2026-03-18 | In-game character via Mineflayer | Enables live eval, auto-verified training data, and a player-facing feature. Mineflayer supports 1.21.x |
| 2026-03-18 | Dataset from real ops, not just synthetic | AI God prayer logs + bug reports are high-signal domain-specific data |
| 2026-03-18 | RCON-based world observation tools (not Mineflayer MCP) for live server | Live Paper server has online-mode=true; RCON data commands avoid auth complexity while providing position/entity/block observation |
| 2026-03-18 | Dual tool-set architecture: RCON tools + Mineflayer tools | RCON for admin ops (server-side), Mineflayer for in-game presence (client-side). Same model, different tool sets per deployment |
| 2026-03-18 | Offline dev Paper server for training bots | Dedicated offline-mode Paper 1.21.11 on port 25568. Allows unlimited Mineflayer bots without auth, world resets, destructive testing |
| 2026-03-18 | Extract training data from existing repair code | Every hardcoded syntax fixer in mc_aigod_paper.py encodes a wrong->correct pair. 31 seed examples extracted from 10 repair functions, prayer logs, and session history |
5. Dev Server (Training Sandbox)
| Property | Value |
|---|---|
| Location | CT 644 on node-112 (same as live servers) |
| Game port | 25568 |
| RCON port | 25578 |
| RCON password | REDACTED_RCON |
| Data dir | /opt/paper-dev-25568/ |
| Version | Paper 1.21.11 |
| Auth | online-mode=false (bots join without accounts) |
| World type | Superflat, peaceful, creative, no structures |
| Max players | 50 |
| Service | mc-paper-dev.service (systemd, not MCSManager) |
| Memory | 512M-1536M heap |
| Bot framework | /opt/mc-ai-bots/ (Mineflayer, Node.js v20) |
Management:
# On CT 644:
systemctl start mc-paper-dev # Start dev server
systemctl stop mc-paper-dev # Stop dev server
systemctl status mc-paper-dev # Check status
# Spawn test bots:
cd /opt/mc-ai-bots
PATH=/opt/mcsmanager/node-v20.12.2-linux-x64/bin:$PATH
node spawn_bots.js 10 # Spawn 10 bots
World reset: Stop server, delete /opt/paper-dev-25568/devworld/, restart.
6. Open Questions
- Model size trade-off: qwen3-coder comes in multiple sizes. Which fits in homelab VRAM while being smart enough? Need to benchmark.
- Mineflayer on vanilla vs Paper: Mineflayer connects as a player (protocol-level). Works with vanilla servers but needs
online-mode=falseor an account. Implications for server slots and authentication. - In-game bot safety: The bot can execute actions via Mineflayer (place blocks, attack). Need strict guardrails separate from the RCON guardrails.
- Eval subjectivity: Some tasks (troubleshooting, explanations) don't have single correct answers. Need to define scoring rubrics or use LLM-as-judge.
- Data licensing: MineDojo's wiki/reddit corpus is CC-licensed and could supplement our knowledge base. Worth investigating.
7. Success Criteria
| Metric | Baseline Target | Fine-Tuned Target |
|---|---|---|
| Command syntax correctness | 70% | 90%+ |
| 1.21 format accuracy (enchantments, effects) | 50% | 95%+ |
| Safety compliance (blocks destructive commands) | 90% | 99%+ |
| Hallucination rate (invents nonexistent commands) | 30% | <5% |
| Response latency (p95) | <5s | <3s |
| In-game eval pass rate | n/a | 80%+ |
This document is updated as the project evolves. Check git history for previous versions.