Files

T

Seth 77efac0283 Add knowledge corpus: 14 command references, server context, and TF-IDF search index (Phase 1.3)

- knowledge/mc-commands/commands.json: 14 MC commands with JE syntax, args, examples, common errors, 1.21 version notes
- knowledge/server-context/servers.json: all 4 servers (mc1, shrink, paper-ai, paper-dev) with full config
- knowledge/build_index.py: TF-IDF indexer + search function (19 docs, 725 terms)
- All command syntax validated live on dev server via RCON (12/13 passed)
- PLAN.md: mark Phase 1.3 complete

2026-03-18 02:01:12 -04:00

20 KiB

Raw Blame History

PLAN.md -- Project Roadmap (Live Document)

Last updated: 2026-03-18 (rev 2) Status legend: [ ] planned | [~] in progress | [x] done | [-] cancelled/deferred

0. Vision

Build a lightweight, Minecraft-focused AI assistant by adapting qwen3-coder (LoRA/SFT). The assistant operates as an ops copilot for Sethpc Minecraft servers -- generating correct commands, troubleshooting logs, automating admin tasks, and optionally acting as an in-game AI character for live interaction, training data collection, and evaluation.

This is not a gameplay agent (like Voyager/MineDojo). It is a server operations assistant with an optional embodied presence for testing and data gathering.

1. Prior Art & Inspirations

These projects informed the plan but solve different problems:

Project	What it does	What we borrow
Voyager (6.7k stars)	LLM-powered embodied agent that plays Minecraft via Mineflayer. Skill library + auto-curriculum + iterative prompting.	Skill library concept (reusable verified command sequences). Iterative self-verification loop for command correctness.
MineDojo (2.2k stars)	RL/LLM research framework with 3142 tasks. Internet-scale knowledge base (730K YouTube vids, 7K wiki pages, 340K Reddit posts).	Knowledge corpus pipeline -- scraping wiki.vg and Minecraft Wiki for command syntax reference data. Task-based evaluation structure.
Mindcraft (4.9k stars)	LLM + Mineflayer in-game bots with profiles, multi-agent collab. Supports Ollama, many APIs.	Profile-based bot architecture. In-game chat integration pattern. Ollama local model support. Provides own fine-tuned models (`sweaterdog/andy-4`).
minecraft-mcp-server (514 stars)	MCP (Model Context Protocol) server wrapping Mineflayer. Lets Claude/LLMs control a Minecraft character via tool calls.	MCP tool-call interface for in-game actions. Could be adapted for our eval harness.
Mineflayer (6.7k stars)	Node.js Minecraft bot framework. Supports 1.8-1.21.11. Movement, inventory, chat, block interaction.	Primary framework for in-game AI character. Mature, well-maintained, 1.21 support confirmed.
Existing AI God system (our own)	Log-tail + RCON + Ollama pipeline. `pray` trigger, divine intervention, command validation, syntax repair. Vanilla + Paper fork.	Direct predecessor. Baseline to measure against. Source of real training data (prayer logs, bug reports).

2. Architecture Overview

                    +---------------------+
                    |   Minecraft Server   |
                    |  (CT 644, 1.21.x)   |
                    +----+----------+-----+
                         |          |
                    RCON |          | Protocol (Mineflayer)
                         |          |
               +---------+--+  +---+------------+
               | Ops Layer   |  | In-Game Agent  |
               | (existing   |  | (Mineflayer    |
               |  log-tail + |  |  bot, optional)|
               |  RCON cmds) |  +---+------------+
               +---------+--+      |
                         |         |
                    +----+---------+----+
                    |  Assistant Core   |
                    |  (qwen3-coder     |
                    |   + LoRA adapter) |
                    +----+----+---------+
                         |    |
                +--------+    +--------+
                |                      |
          +-----+------+    +---------+--------+
          | Tool Layer  |    | Knowledge/RAG    |
          | - RCON exec |    | - MC Wiki index  |
          | - Log query |    | - Command syntax |
          | - MCSManager|    | - Server context |
          |   API       |    | - Prior sessions |
          +-------------+    +------------------+

3. Phased Roadmap

Phase 1: Foundation (Weeks 1-3) -- HIGH DETAIL

Goal: Repo setup, baseline tooling, dataset schema, knowledge corpus.

1.1 Project Setup

Define project idea and constraints (IDEA.md)
Confirm no prior art exists for this specific niche
Create PLAN.md (this document)
Create Gitea repo and configure remote

Set up directory structure:

Mincecraft-AI-model/
├── PLAN.md
├── IDEA.md
├── SESSION.md             # local only (gitignored)
├── SESSION.default.md     # template reference (tracked)
├── .gitignore
├── data/
│   ├── raw/               # scraped wiki, logs, transcripts
│   ├── processed/         # cleaned, formatted training pairs
│   │   └── seed_dataset.jsonl  # 31 seed examples
│   ├── schema.json        # dataset JSON Schema
│   └── validate_dataset.py
├── knowledge/
│   ├── mc-commands/       # 1.21 command syntax reference
│   ├── server-context/    # server.properties, datapacks, infra
│   └── wiki-chunks/       # chunked wiki content for RAG
├── eval/
│   ├── tasks/             # evaluation task definitions
│   └── results/           # scored outputs (gitignored)
├── training/
│   ├── configs/           # LoRA/SFT training configs
│   ├── scripts/           # training launch scripts
│   └── checkpoints/       # saved adapters (gitignored)
├── agent/
│   ├── tools/             # RCON, log query, MCSManager tools
│   ├── guardrails/        # command allowlist, safety policies
│   └── prompts/           # system prompts, few-shot templates
└── ingame/                # in-game bots (Mineflayer)
    ├── package.json
    ├── test_connect.js    # single bot connection test
    ├── spawn_bots.js      # multi-bot spawner (passive)
    └── aware_bots.js      # event-aware bots (training data)

Add .gitignore (checkpoints, secrets, pycache, node_modules)
Initial commit and push

1.2 Dataset Schema

Define the training example format (data/schema.json) -- includes negative_output for wrong->correct pairs
Write a JSON Schema validator script (data/validate_dataset.py)
Seed 31 examples from repair code, prayer logs, sudo logs, and session history (data/processed/seed_dataset.jsonl)

1.3 Knowledge Corpus

Scrape Minecraft Wiki command reference pages for 1.21.x syntax (14 commands in knowledge/mc-commands/commands.json)
- Includes JE syntax, arguments, examples, version notes, and common errors per command
- Commands validated live on dev server (Paper 1.21.11) -- 12/13 passed, 1 false negative (already in target state)
Extract and chunk local server context (knowledge/server-context/servers.json)
- All 4 servers (mc1, shrink-world, paper-ai, paper-dev) with ports, RCON, settings, plugins
- Player list with UUIDs, infrastructure details, version-specific notes
Index knowledge corpus for RAG retrieval (knowledge/build_index.py -- TF-IDF with title boosting)
- 19 documents indexed, 725 unique terms
Validated with 6 test queries -- all return relevant top results

1.4 Baseline Assistant (No Fine-Tuning)

Build prompt-only assistant using qwen3-coder (via Ollama at 192.168.0.179)
Implement tool-calling interface:
- rcon_execute(command) -- send RCON command, return result
- query_log(pattern, lines) -- search recent server log
- query_knowledge(question) -- RAG lookup against knowledge corpus
- get_server_status() -- player list, TPS, uptime via MCSManager API
Implement safety guardrails:
- Command allowlist (whitelist known-safe command prefixes)
- Destructive action confirmation (commands matching /kill, /stop, /ban, /op, /fill, /worldborder set 0)
- Syntax validation (1.21 enchantment format, weather values, effect names)
- Audit log (every command attempted + result, timestamped JSON)
Test baseline on 20 seed examples, record accuracy manually
Document baseline performance as the bar to beat

Phase 2: Data Collection & Evaluation Framework (Weeks 3-5) -- MEDIUM DETAIL

Goal: Build a proper eval suite and expand the dataset using real server interactions.

2.1 Evaluation Suite

Define task categories:
- Command generation -- "Give player X netherite sword with sharpness 5" -> correct /give command
- Troubleshooting -- "Server is lagging" + log excerpt -> diagnosis + recommended actions
- Automation -- "Shrink border by 10 every time someone dies" -> datapack/script plan
- Information -- "What enchantments work on tridents in 1.21?" -> accurate answer
- Safety -- "Delete the world" -> refusal or confirmation gate
Write 50+ evaluation tasks across categories (target: 100 eventually)
Build evaluation harness (eval/harness.py):
- Loads task definitions
- Runs each through the assistant
- Scores: command syntax correctness (parseable?), factual accuracy, safety compliance, hallucination detection
- Outputs scored results as JSON + summary report
Run baseline evaluation, establish benchmark scores

2.2 Data Expansion

Extract training pairs from existing AI God prayer logs on CT 644
- Parse /var/log/mc_aigod_*.log and prayer history
- Convert to dataset schema format
- Label quality: validated/unvalidated, correct/incorrect
Extract pairs from bug_log reports (negative examples -- what went wrong)
Generate synthetic examples:
- Use a strong model (Claude/GPT-4) to generate diverse MC ops questions
- Filter through command validator for correctness
- Human review a sample for quality
Target: 500+ training examples by end of Phase 2

2.3 Data Pipeline

Build ingestion script: raw logs/transcripts -> parsed -> schema-validated -> data/processed/
Build deduplication and quality filters
Version the dataset (git-tracked or DVC)

Phase 3: Fine-Tuning (Weeks 5-8) -- MEDIUM DETAIL

Goal: LoRA/SFT adaptation of qwen3-coder on the collected dataset.

3.1 Training Infrastructure

Decide hardware target:
- Option A: steel141 (gaming PC, local GPU) -- best for iteration speed
- Option B: Ollama server (192.168.0.179, CT 105) -- if GPU is available there
- Option C: cloud burst (RunPod/Lambda) for larger runs
Set up training environment (PyTorch, transformers, peft/LoRA, datasets)
Write training config (LoRA rank, learning rate, epochs, batch size)
Write training launch script with logging (Weights & Biases or simple file-based)

3.2 First Training Run

Format dataset for SFT (instruction/input/output or chat template)
Train LoRA adapter on qwen3-coder base
Run eval suite on fine-tuned model
Compare against baseline: does fine-tuning help or hurt?
Iterate: adjust data mix, hyperparameters, prompt format

3.3 Iterative Improvement

Identify weak categories from eval results
Targeted data collection for weak areas
Retrain and re-evaluate (repeat cycle)
Track all runs with configs + scores for reproducibility

Phase 4: In-Game AI Character (Weeks 6-10) -- MEDIUM DETAIL

Goal: Deploy an LLM-controlled bot inside the Minecraft server for live interaction, data collection, and evaluation.

This phase can overlap with Phase 3. The in-game character serves three purposes:

Live evaluation -- test the model's command generation in real game context
Training data collection -- log all interactions as labeled examples
User-facing feature -- players can interact with an AI character in-game

4.1 Bot Framework

Set up Mineflayer bot in ingame/ directory
- Connect to mc1 server (192.168.0.244:25565) in offline auth mode
- Bot name: configurable (e.g. "Oracle", "Scribe", or themed to AI God persona)
Implement chat listener: player says something -> parsed as request
Implement LLM bridge: request -> qwen3-coder (Ollama) -> structured response
Implement action executor: structured response -> RCON commands and/or Mineflayer actions

4.2 In-Game Capabilities

Chat interaction -- respond to player questions about the server, commands, game mechanics
Command demonstration -- execute commands and show results in-game
World observation -- read nearby blocks, entities, player positions (via Mineflayer API)
Eval-in-the-loop -- after executing a command, observe the result and self-verify:
- "Did the block actually get placed?"
- "Is the player's inventory correct?"
- "Did the effect apply?"
- Log success/failure as labeled training data

4.3 Training Data Pipeline (In-Game)

Every interaction logged as a candidate training example:

{
  "source": "ingame_live",
  "input": { "user_message": "...", "world_state": {...} },
  "output": { "commands": [...], "result": "success|failure|partial" },
  "verified": true  // because we observed the outcome
}

Successful interactions -> positive training examples
Failed interactions -> negative examples or correction candidates
Periodic batch export to data/processed/ for retraining

4.4 Inspiration from Existing Systems

Mindcraft-style profiles for bot personality and behavior tuning
Voyager-style skill library: successful command sequences saved and reusable
MCP server pattern for clean tool-call interface between LLM and game actions
Our own AI God pray system as the interaction model (but the bot IS the character, not just an RCON relay)

Phase 5: Deployment & Serving (Weeks 8-12) -- LOW DETAIL

Goal: Production-ready serving on homelab infrastructure.

Choose serving stack:
- Ollama with custom model (simplest, already in use)
- vLLM for better throughput if needed
- llama.cpp / llamafile for minimal footprint
Package fine-tuned adapter + base model as a single deployable artifact
Deploy to target node (Ollama at 192.168.0.179 or steel141)
Wire up to existing AI God services (replace/augment current Ollama calls)
Implement model switching: A/B test fine-tuned vs. base model
Set up health checks, restart policies, log rotation
Caddy reverse proxy if exposing API endpoint

Phase 6: Observability & Iteration (Ongoing) -- LOW DETAIL

Goal: Continuous improvement loop with monitoring and feedback.

Dashboard for model performance (Grafana at monitor.sethpc.xyz)
- Command accuracy rate over time
- Hallucination rate
- Safety trigger frequency
- Latency percentiles
Player feedback loop (in-game rating or bug_log integration)
Automated retraining pipeline:
- New validated examples accumulate
- Periodic retrain trigger (manual or scheduled)
- Eval gate: new model must beat current on eval suite to deploy
Expand to multi-server support (mc1, shrink-world, Paper fork)
Explore distillation from stronger models (Claude -> qwen3-coder dataset augmentation)

Phase 7: Advanced Features (Future) -- SKETCH ONLY

These are ideas to explore after the core system is working. Prioritize based on what's actually useful.

Multi-turn conversation memory (SQLite or Redis-backed sessions)
Proactive monitoring: model watches logs continuously, alerts on anomalies
Natural language -> datapack generation (write mcfunction files from descriptions)
Cross-server orchestration (manage multiple servers from one assistant)
Voice interface (TTS/STT for in-game narration, Discord integration)
Public model release on HuggingFace if quality is good enough
Web dashboard for non-technical server admins
Integration with n8n for workflow automation triggers

4. Key Decisions Log

Date	Decision	Rationale
2026-03-18	Base model: `qwen3-coder`	Good code/instruction following, runs on homelab hardware via Ollama, LoRA-friendly
2026-03-18	Adaptation approach: LoRA/SFT, not full pretrain	Cost-effective, iterative, preserves base capabilities
2026-03-18	Build baseline first, tune later	Need measurement before optimization. Prompt+tools may already be "good enough" for many tasks
2026-03-18	In-game character via Mineflayer	Enables live eval, auto-verified training data, and a player-facing feature. Mineflayer supports 1.21.x
2026-03-18	Dataset from real ops, not just synthetic	AI God prayer logs + bug reports are high-signal domain-specific data
2026-03-18	RCON-based world observation tools (not Mineflayer MCP) for live server	Live Paper server has online-mode=true; RCON data commands avoid auth complexity while providing position/entity/block observation
2026-03-18	Dual tool-set architecture: RCON tools + Mineflayer tools	RCON for admin ops (server-side), Mineflayer for in-game presence (client-side). Same model, different tool sets per deployment
2026-03-18	Offline dev Paper server for training bots	Dedicated offline-mode Paper 1.21.11 on port 25568. Allows unlimited Mineflayer bots without auth, world resets, destructive testing
2026-03-18	Extract training data from existing repair code	Every hardcoded syntax fixer in mc_aigod_paper.py encodes a wrong->correct pair. 31 seed examples extracted from 10 repair functions, prayer logs, and session history

5. Dev Server (Training Sandbox)

Property	Value
Location	CT 644 on node-112 (same as live servers)
Game port	`25568`
RCON port	`25578`
RCON password	`REDACTED_RCON`
Data dir	`/opt/paper-dev-25568/`
Version	Paper 1.21.11
Auth	`online-mode=false` (bots join without accounts)
World type	Superflat, peaceful, creative, no structures
Max players	50
Service	`mc-paper-dev.service` (systemd, not MCSManager)
Memory	512M-1536M heap
Bot framework	`/opt/mc-ai-bots/` (Mineflayer, Node.js v20)

Management:

# On CT 644:
systemctl start mc-paper-dev    # Start dev server
systemctl stop mc-paper-dev     # Stop dev server
systemctl status mc-paper-dev   # Check status

# Spawn test bots:
cd /opt/mc-ai-bots
PATH=/opt/mcsmanager/node-v20.12.2-linux-x64/bin:$PATH
node spawn_bots.js 10           # Spawn 10 bots

World reset: Stop server, delete /opt/paper-dev-25568/devworld/, restart.

6. Open Questions

Model size trade-off: qwen3-coder comes in multiple sizes. Which fits in homelab VRAM while being smart enough? Need to benchmark.
Mineflayer on vanilla vs Paper: Mineflayer connects as a player (protocol-level). Works with vanilla servers but needs online-mode=false or an account. Implications for server slots and authentication.
In-game bot safety: The bot can execute actions via Mineflayer (place blocks, attack). Need strict guardrails separate from the RCON guardrails.
Eval subjectivity: Some tasks (troubleshooting, explanations) don't have single correct answers. Need to define scoring rubrics or use LLM-as-judge.
Data licensing: MineDojo's wiki/reddit corpus is CC-licensed and could supplement our knowledge base. Worth investigating.

7. Success Criteria

Metric	Baseline Target	Fine-Tuned Target
Command syntax correctness	70%	90%+
1.21 format accuracy (enchantments, effects)	50%	95%+
Safety compliance (blocks destructive commands)	90%	99%+
Hallucination rate (invents nonexistent commands)	30%	<5%
Response latency (p95)	<5s	<3s
In-game eval pass rate	n/a	80%+

This document is updated as the project evolves. Check git history for previous versions.

20 KiB Raw Blame History