Files

T

Seth 78031d16c0 Risk gradient (0-5), updated system prompts, 233 examples

Risk gradient system:
- All 233 training examples tagged with risk_level (0-5)
- 0=blocked(15), 1=refuse(9), 2=warn(17), 3=normal(169), 4=generous(23)
- Schema updated with risk_level and scoring_mode fields
- Eval harness uses risk_level for safety scoring

System prompts rewritten:
- Shared syntax rules and risk gradient reference across all modes
- Sudo: permission level 4, do what admin asks, only refuse level 0-1
- God: permission level 2-4 (mood-dependent), character-driven decisions
- God_system: permission level 3, 80% benevolent / 15% mischievous / 5% wrathful

Data:
- 20 new live playtest examples from training audit log (233 total)
- 43 wrong→right pairs (17 from validator repairs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-18 16:14:54 -04:00

24 KiB

Raw Blame History

PLAN.md -- Project Roadmap (Live Document)

Last updated: 2026-03-18 (rev 2) Status legend: [ ] planned | [~] in progress | [x] done | [-] cancelled/deferred

0. Vision

Build a lightweight, Minecraft-focused AI assistant by adapting qwen3-coder (LoRA/SFT). The assistant operates as an ops copilot for Sethpc Minecraft servers -- generating correct commands, troubleshooting logs, automating admin tasks, and optionally acting as an in-game AI character for live interaction, training data collection, and evaluation.

This is not a gameplay agent (like Voyager/MineDojo). It is a server operations assistant with an optional embodied presence for testing and data gathering.

1. Prior Art & Inspirations

These projects informed the plan but solve different problems:

Project	What it does	What we borrow
Voyager (6.7k stars)	LLM-powered embodied agent that plays Minecraft via Mineflayer. Skill library + auto-curriculum + iterative prompting.	Skill library concept (reusable verified command sequences). Iterative self-verification loop for command correctness.
MineDojo (2.2k stars)	RL/LLM research framework with 3142 tasks. Internet-scale knowledge base (730K YouTube vids, 7K wiki pages, 340K Reddit posts).	Knowledge corpus pipeline -- scraping wiki.vg and Minecraft Wiki for command syntax reference data. Task-based evaluation structure.
Mindcraft (4.9k stars)	LLM + Mineflayer in-game bots with profiles, multi-agent collab. Supports Ollama, many APIs.	Profile-based bot architecture. In-game chat integration pattern. Ollama local model support. Provides own fine-tuned models (`sweaterdog/andy-4`).
minecraft-mcp-server (514 stars)	MCP (Model Context Protocol) server wrapping Mineflayer. Lets Claude/LLMs control a Minecraft character via tool calls.	MCP tool-call interface for in-game actions. Could be adapted for our eval harness.
Mineflayer (6.7k stars)	Node.js Minecraft bot framework. Supports 1.8-1.21.11. Movement, inventory, chat, block interaction.	Primary framework for in-game AI character. Mature, well-maintained, 1.21 support confirmed.
Existing AI God system (our own)	Log-tail + RCON + Ollama pipeline. `pray` trigger, divine intervention, command validation, syntax repair. Vanilla + Paper fork.	Direct predecessor. Baseline to measure against. Source of real training data (prayer logs, bug reports).

2. Architecture Overview

                    +---------------------+
                    |   Minecraft Server   |
                    |  (CT 644, 1.21.x)   |
                    +----+----------+-----+
                         |          |
                    RCON |          | Protocol (Mineflayer)
                         |          |
               +---------+--+  +---+------------+
               | Ops Layer   |  | In-Game Agent  |
               | (existing   |  | (Mineflayer    |
               |  log-tail + |  |  bot, optional)|
               |  RCON cmds) |  +---+------------+
               +---------+--+      |
                         |         |
                    +----+---------+----+
                    |  Assistant Core   |
                    |  (qwen3-coder     |
                    |   + LoRA adapter) |
                    +----+----+---------+
                         |    |
                +--------+    +--------+
                |                      |
          +-----+------+    +---------+--------+
          | Tool Layer  |    | Knowledge/RAG    |
          | - RCON exec |    | - MC Wiki index  |
          | - Log query |    | - Command syntax |
          | - MCSManager|    | - Server context |
          |   API       |    | - Prior sessions |
          +-------------+    +------------------+

3. Phased Roadmap

Phase 1: Foundation (Weeks 1-3) -- HIGH DETAIL

Goal: Repo setup, baseline tooling, dataset schema, knowledge corpus.

1.1 Project Setup

Define project idea and constraints (IDEA.md)
Confirm no prior art exists for this specific niche
Create PLAN.md (this document)
Create Gitea repo and configure remote

Set up directory structure:

Mincecraft-AI-model/
├── PLAN.md
├── IDEA.md
├── SESSION.md             # local only (gitignored)
├── SESSION.default.md     # template reference (tracked)
├── .gitignore
├── data/
│   ├── raw/               # scraped wiki, logs, transcripts
│   ├── processed/         # cleaned, formatted training pairs
│   │   └── seed_dataset.jsonl  # 31 seed examples
│   ├── schema.json        # dataset JSON Schema
│   └── validate_dataset.py
├── knowledge/
│   ├── mc-commands/       # 1.21 command syntax reference
│   ├── server-context/    # server.properties, datapacks, infra
│   └── wiki-chunks/       # chunked wiki content for RAG
├── eval/
│   ├── tasks/             # evaluation task definitions
│   └── results/           # scored outputs (gitignored)
├── training/
│   ├── configs/           # LoRA/SFT training configs
│   ├── scripts/           # training launch scripts
│   └── checkpoints/       # saved adapters (gitignored)
├── agent/
│   ├── tools/             # RCON, log query, MCSManager tools
│   ├── guardrails/        # command allowlist, safety policies
│   └── prompts/           # system prompts, few-shot templates
└── ingame/                # in-game bots (Mineflayer)
    ├── package.json
    ├── test_connect.js    # single bot connection test
    ├── spawn_bots.js      # multi-bot spawner (passive)
    └── aware_bots.js      # event-aware bots (training data)

Add .gitignore (checkpoints, secrets, pycache, node_modules)
Initial commit and push

1.2 Dataset Schema

Define the training example format (data/schema.json) -- includes negative_output for wrong->correct pairs
Write a JSON Schema validator script (data/validate_dataset.py)
Seed 31 examples from repair code, prayer logs, sudo logs, and session history (data/processed/seed_dataset.jsonl)

1.3 Knowledge Corpus

Scrape Minecraft Wiki command reference pages for 1.21.x syntax (14 commands in knowledge/mc-commands/commands.json)
- Includes JE syntax, arguments, examples, version notes, and common errors per command
- Commands validated live on dev server (Paper 1.21.11) -- 12/13 passed, 1 false negative (already in target state)
Extract and chunk local server context (knowledge/server-context/servers.json)
- All 4 servers (mc1, shrink-world, paper-ai, paper-dev) with ports, RCON, settings, plugins
- Player list with UUIDs, infrastructure details, version-specific notes
Index knowledge corpus for RAG retrieval (knowledge/build_index.py -- TF-IDF with title boosting)
- 19 documents indexed, 725 unique terms
Validated with 6 test queries -- all return relevant top results

1.4 Baseline Assistant (No Fine-Tuning)

Build prompt-only assistant (agent/serve.py) with Ollama integration
- Interactive CLI, single-query, and dataset evaluation modes
- Configurable model, RCON, Ollama URL via JSON config or CLI args
Implement tool-calling interface:
- agent/tools/rcon_tool.py -- RCON execute, get_server_status, get_player_info
- agent/tools/knowledge_tool.py -- RAG search, command reference lookup, server context
Implement safety guardrails (agent/guardrails/command_filter.py):
- Command allowlist (14 safe prefixes, blocks /stop /op /ban etc.)
- Execute-tail bypass detection (blocks unsafe commands inside execute chains)
- Destructive action detection (kill @a, fill air, worldborder 0, TNT, fire)
- 1.21 syntax validation warnings (old NBT, bare effect, weather storm, gamemode abbrevs)
- Audit log (every query + commands + results to data/raw/audit_log.jsonl)
- All guardrails validated: 10/10 allowlist, 5/6 syntax warnings
System prompts for sudo, god, and intervention modes (agent/prompts/system_prompts.py)
Run baseline evaluation on seed dataset, record accuracy
Document baseline performance as the bar to beat

Phase 2: Data Collection & Evaluation Framework (Weeks 3-5) -- MEDIUM DETAIL

Goal: Build a proper eval suite and expand the dataset using real server interactions.

2.1 Evaluation Suite

Define task categories:
- Command generation (50 examples) -- "Give player X netherite sword with sharpness 5" -> correct /give command
- Troubleshooting (6 examples) -- "Server is lagging" -> diagnosis + recommended actions
- Information (6 examples) -- "What enchantments work on tridents in 1.21?" -> accurate answer
- Safety (10 examples) -- "Delete the world" -> refusal, social engineering, indirect destruction, privilege escalation
- Negative (4 examples) -- Known failure modes (JSON escaping, hallucination)
- Automation -- deferred (need datapack examples)
Write 182 evaluation tasks across categories (target was 100; exceeded)
- Phase 1 seed: 31 examples (repair patterns, prayer logs, session history)
- Phase 2 manual: 45 examples (troubleshooting, edge cases, ambiguity, safety, info)
- Phase 2 log extraction: 106 examples (58 sudo, 34 prayer, 14 bug reports from CT 644 logs)
Build evaluation harness (eval/harness.py):
- Per-category breakdowns, baseline comparison with deltas
- Hallucination detection, empty response tracking, gratuitous action detection
- Failure detail reporting for targeted improvement
- --save-baseline / --baseline for tracking improvement over time
Build live bake-off harness (eval/live_bakeoff.py):
- Executes commands via RCON on real server, measures rcon_success rate
- Side-by-side model comparison with RCON disagreement analysis
Run baseline evaluation, establish benchmark scores:
- gemma3n:e4b baseline: 59.2% cmd match, 82.9% syntax, 93.4% safety
- qwen3:8b comparison: 73.7% cmd match, 82.9% syntax, 92.1% safety
- Key gaps: troubleshooting (16-33%), info queries (0-67%), safety (40-50%)

2.2 Data Expansion

Extract training pairs from existing AI God prayer logs on CT 644
- Parsed paper + shrink service logs, prayer memories, bug logs
- 106 examples extracted (58 sudo, 34 prayer, 14 bug reports)
- All tagged validated=false, needs human review
Extract pairs from bug_log reports (negative examples -- what went wrong)
- 14 negative examples from bug reports showing model failures
- Common failures: invalid item IDs, old NBT syntax, fall damage from TP, suffocation
Generate synthetic examples:
- Use a strong model (Claude/GPT-4) to generate diverse MC ops questions
- Filter through command validator for correctness
- Human review a sample for quality
Target: 500+ training examples by end of Phase 2 (currently 182)

2.3 Data Pipeline

Structured training audit log added to mc_aigod_paper.py
- Every pray/sudo interaction writes JSONL to /var/log/mc_training_audit.jsonl
- Captures: player, mode, commands_generated, commands_executed, rcon_results, server context
- Auto-infers category (command_gen, info, safety, troubleshoot)
- All entries tagged needs_review=true
Enhanced bug_log → training feedback pipeline
- bug_log entries now write structured feedback to training audit
- Links to player's last sudo/prayer interaction
- Trust level tagging: admin="verified", playtesters="unverified"
- Non-admin feedback gets reviewer_notes warning about possible wrong expectations
Playtest infrastructure
- All servers switched to online-mode=false + whitelist (slingshooter08 whitelisted)
- sudo_allow_all_players config flag added (enabled for paper-ai)
- Reddit post draft + Google Form application created
- Training servers: paper-ai (primary, human playtesters) + paper-dev (bots, destructive testing)
Build ingestion script: raw logs/transcripts -> parsed -> schema-validated -> data/processed/
Build deduplication and quality filters
Version the dataset (git-tracked or DVC)

Phase 3: Fine-Tuning (Weeks 5-8) -- MEDIUM DETAIL

Goal: LoRA/SFT adaptation of qwen3-coder on the collected dataset.

3.1 Training Infrastructure

Decide hardware target:
- Option A: steel141 (gaming PC, local GPU) -- best for iteration speed
- Option B: Ollama server (192.168.0.179, CT 105) -- if GPU is available there
- Option C: cloud burst (RunPod/Lambda) for larger runs
Set up training environment (PyTorch, transformers, peft/LoRA, datasets)
Write training config (LoRA rank, learning rate, epochs, batch size)
Write training launch script with logging (Weights & Biases or simple file-based)

3.2 First Training Run

Format dataset for SFT (instruction/input/output or chat template)
Train LoRA adapter on qwen3-coder base
Run eval suite on fine-tuned model
Compare against baseline: does fine-tuning help or hurt?
Iterate: adjust data mix, hyperparameters, prompt format

3.3 Iterative Improvement

Identify weak categories from eval results
Targeted data collection for weak areas
Retrain and re-evaluate (repeat cycle)
Track all runs with configs + scores for reproducibility

Phase 4: In-Game AI Character (Weeks 6-10) -- MEDIUM DETAIL

Goal: Deploy an LLM-controlled bot inside the Minecraft server for live interaction, data collection, and evaluation.

This phase can overlap with Phase 3. The in-game character serves three purposes:

Live evaluation -- test the model's command generation in real game context
Training data collection -- log all interactions as labeled examples
User-facing feature -- players can interact with an AI character in-game

4.1 Bot Framework

Set up Mineflayer bot in ingame/ directory
- Connect to mc1 server (192.168.0.244:25565) in offline auth mode
- Bot name: configurable (e.g. "Oracle", "Scribe", or themed to AI God persona)
Implement chat listener: player says something -> parsed as request
Implement LLM bridge: request -> qwen3-coder (Ollama) -> structured response
Implement action executor: structured response -> RCON commands and/or Mineflayer actions

4.2 In-Game Capabilities

Chat interaction -- respond to player questions about the server, commands, game mechanics
Command demonstration -- execute commands and show results in-game
World observation -- read nearby blocks, entities, player positions (via Mineflayer API)
Eval-in-the-loop -- after executing a command, observe the result and self-verify:
- "Did the block actually get placed?"
- "Is the player's inventory correct?"
- "Did the effect apply?"
- Log success/failure as labeled training data

4.3 Training Data Pipeline (In-Game)

Every interaction logged as a candidate training example:

{
  "source": "ingame_live",
  "input": { "user_message": "...", "world_state": {...} },
  "output": { "commands": [...], "result": "success|failure|partial" },
  "verified": true  // because we observed the outcome
}

Successful interactions -> positive training examples
Failed interactions -> negative examples or correction candidates
Periodic batch export to data/processed/ for retraining

4.4 Inspiration from Existing Systems

Mindcraft-style profiles for bot personality and behavior tuning
Voyager-style skill library: successful command sequences saved and reusable
MCP server pattern for clean tool-call interface between LLM and game actions
Our own AI God pray system as the interaction model (but the bot IS the character, not just an RCON relay)

Phase 5: Deployment & Serving (Weeks 8-12) -- LOW DETAIL

Goal: Production-ready serving on homelab infrastructure.

Choose serving stack:
- Ollama with custom model (simplest, already in use)
- vLLM for better throughput if needed
- llama.cpp / llamafile for minimal footprint
Package fine-tuned adapter + base model as a single deployable artifact
Deploy to target node (Ollama at 192.168.0.179 or steel141)
Wire up to existing AI God services (replace/augment current Ollama calls)
Implement model switching: A/B test fine-tuned vs. base model
Set up health checks, restart policies, log rotation
Caddy reverse proxy if exposing API endpoint

Phase 6: Observability & Iteration (Ongoing) -- LOW DETAIL

Goal: Continuous improvement loop with monitoring and feedback.

Dashboard for model performance (Grafana at monitor.sethpc.xyz)
- Command accuracy rate over time
- Hallucination rate
- Safety trigger frequency
- Latency percentiles
Player feedback loop (in-game rating or bug_log integration)
Automated retraining pipeline:
- New validated examples accumulate
- Periodic retrain trigger (manual or scheduled)
- Eval gate: new model must beat current on eval suite to deploy
Expand to multi-server support (mc1, shrink-world, Paper fork)
Explore distillation from stronger models (Claude -> qwen3-coder dataset augmentation)

Phase 7: Advanced Features (Future) -- SKETCH ONLY

These are ideas to explore after the core system is working. Prioritize based on what's actually useful.

Multi-turn conversation memory (SQLite or Redis-backed sessions)
Proactive monitoring: model watches logs continuously, alerts on anomalies
Natural language -> datapack generation (write mcfunction files from descriptions)
Cross-server orchestration (manage multiple servers from one assistant)
Voice interface (TTS/STT for in-game narration, Discord integration)
Public model release on HuggingFace if quality is good enough
Web dashboard for non-technical server admins
Integration with n8n for workflow automation triggers

4. Key Decisions Log

Date	Decision	Rationale
2026-03-18	~~Base model: `qwen3-coder`~~	~~Good code/instruction following~~ — Superseded: see below
2026-03-18	Serving model: `gemma3n:e4b` (6.9B)	Bake-off winner: 80.6% cmd match, 100% safety, 5.9s latency. Beats qwen3-coder:30b on all metrics. Deployed to RTX 4000 on node-197.
2026-03-18	Fine-tuning base: `qwen3:8b` (dense, Apache 2.0)	77.4% cmd match with token budget fix. Best syntax quality, perfect safety, strong Unsloth ecosystem. Token-budget issue = exactly what LoRA fixes.
2026-03-18	Training hardware: steel141 RTX 3090 Ti (24GB)	QLoRA on 8B model fits easily. Conda env `mc-train` with Unsloth 2026.3.5 ready.
2026-03-18	Serving hardware: node-197 RTX 4000 (8GB) via Ollama	35/36 layers GPU offload for 7B models. Always-on, no desktop contention.
2026-03-18	Adaptation approach: LoRA/SFT, not full pretrain	Cost-effective, iterative, preserves base capabilities
2026-03-18	Build baseline first, tune later	Need measurement before optimization. Prompt+tools may already be "good enough" for many tasks
2026-03-18	In-game character via Mineflayer	Enables live eval, auto-verified training data, and a player-facing feature. Mineflayer supports 1.21.x
2026-03-18	Dataset from real ops, not just synthetic	AI God prayer logs + bug reports are high-signal domain-specific data
2026-03-18	RCON-based world observation tools (not Mineflayer MCP) for live server	Live Paper server has online-mode=true; RCON data commands avoid auth complexity while providing position/entity/block observation
2026-03-18	Dual tool-set architecture: RCON tools + Mineflayer tools	RCON for admin ops (server-side), Mineflayer for in-game presence (client-side). Same model, different tool sets per deployment
2026-03-18	Offline dev Paper server for training bots	Dedicated offline-mode Paper 1.21.11 on port 25568. Allows unlimited Mineflayer bots without auth, world resets, destructive testing
2026-03-18	Extract training data from existing repair code	Every hardcoded syntax fixer in mc_aigod_paper.py encodes a wrong->correct pair. 31 seed examples extracted from 10 repair functions, prayer logs, and session history
2026-03-18	Numerical risk gradient (0-5) instead of per-mode rule sets	0=blocked (server crash/privesc), 1=refuse (mass harm), 2=warn+allow (self-destructive), 3=normal, 4=generous (admin/creative), 5=unrestricted. Each mode sets a permission threshold: sudo=4, pray=2-4 (mood shifts), god_system=3. One system, not three separate constraint models.
2026-03-18	Mode-aware eval scoring	Sudo scored strict (exact command match). Pray/god scored soft (command category match, in-character message, appropriate intensity). Exact match meaningless for pray — God's creative interpretation is a feature.
2026-03-18	God is a character, not a safety filter	Pray mode: God decides based on worthiness/character/mood. The prayer is input to God's decision, not an instruction. God acts in mysterious ways — sometimes generous, sometimes strict, occasionally wrathful. Training data reflects this with loose expected outputs.
2026-03-18	Validator improvements: 5 new syntax repair functions	@s→player, NBT→component enchants, strip invalid components, hallucinated effect/command repair. Deployed to paper-ai. Every repair is a negative→positive training pair.
2026-03-18	Eval/testing on steel141 (RTX 3090 Ti), not prod RTX 4000	All eval scripts default to 192.168.0.141:11434. Prod GPU reserved for live serving only.

5. Dev Server (Training Sandbox)

Property	Value
Location	CT 644 on node-112 (same as live servers)
Game port	`25568`
RCON port	`25578`
RCON password	`REDACTED_RCON`
Data dir	`/opt/paper-dev-25568/`
Version	Paper 1.21.11
Auth	`online-mode=false` (bots join without accounts)
World type	Superflat, peaceful, creative, no structures
Max players	50
Service	`mc-paper-dev.service` (systemd, not MCSManager)
Memory	512M-1536M heap
Bot framework	`/opt/mc-ai-bots/` (Mineflayer, Node.js v20)

Management:

# On CT 644:
systemctl start mc-paper-dev    # Start dev server
systemctl stop mc-paper-dev     # Stop dev server
systemctl status mc-paper-dev   # Check status

# Spawn test bots:
cd /opt/mc-ai-bots
PATH=/opt/mcsmanager/node-v20.12.2-linux-x64/bin:$PATH
node spawn_bots.js 10           # Spawn 10 bots

World reset: Stop server, delete /opt/paper-dev-25568/devworld/, restart.

6. Open Questions

Model size trade-off: qwen3-coder comes in multiple sizes. Which fits in homelab VRAM while being smart enough? Need to benchmark.
Mineflayer on vanilla vs Paper: Mineflayer connects as a player (protocol-level). Works with vanilla servers but needs online-mode=false or an account. Implications for server slots and authentication.
In-game bot safety: The bot can execute actions via Mineflayer (place blocks, attack). Need strict guardrails separate from the RCON guardrails.
Eval subjectivity: Some tasks (troubleshooting, explanations) don't have single correct answers. Need to define scoring rubrics or use LLM-as-judge.
Data licensing: MineDojo's wiki/reddit corpus is CC-licensed and could supplement our knowledge base. Worth investigating.

7. Success Criteria

Metric	Actual Baseline (gemma3n)	Actual Baseline (qwen3:8b)	Fine-Tuned Target
Sudo (strict scoring)
Command match (loose)	59.2%	73.7%	85%+
Exact match (strict)	10.5%	18.4%	40%+
RCON success (live)	33.1%	34.6%	70%+
Safety compliance	93.4%	92.1%	99%+
Pray (soft scoring)
Command category match	—	—	80%+
Has in-character message	—	—	95%+
Appropriate intensity	—	—	90%+
All modes
Syntax correctness	82.9%	82.9%	95%+
Hallucination rate	0%	0%	0%
Empty response rate	9.2%	14.5%	<3%
Response latency (avg)	6.4s	13.5s	<5s

This document is updated as the project evolves. Check git history for previous versions.

24 KiB Raw Blame History