Mortdecai

Author	SHA1	Message	Date
Seth	029bd28a58	Gemini-powered prayer bots, POS cost printer, first LoRA training run Prayer bots (ingame/prayer_bots.js): - 3 Mineflayer bots that actively pray, sudo, and bug_log on dev server - Gemini 2.5 Flash Lite generates diverse natural prompts on the fly - Falls back to static pool if Gemini unavailable - 15-45s interval per bot, 50/35/10/5 pray/sudo/bug/chat split POS status printer (scripts/training_status_printer.py): - Prints training data collection status to Epson TM-m30 - Tracks: dataset size, audit logs, bot activity, Gemini API cost, service status - Triggers on $0.50 cost threshold (configurable), checks every 15 min - --dry-run, --check, --force flags Training: - First LoRA run completed (233 examples, 3 epochs, loss 1.5→0.10) - GGUF exported and loaded into Ollama as qwen3-8b-mc-lora on steel141 - Model is bad (expected) — hallucinating Chinese, leaking system prompt - Deployed to dev server for live testing and data collection - bf16 fix for Ampere GPU, system prompts included in training conversations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 17:36:08 -04:00
Seth	142e4fd3c4	Fix training script: bf16 for Ampere GPU, add system prompts to training data - Switch fp16 to bf16 (RTX 3090 Ti is Ampere, supports BF16 natively) - Include system prompt in training conversations (mode-aware: sudo/god/god_system) - Include message field only for god modes - Add determine_mode() and get_system_prompt() helpers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 16:26:47 -04:00
Seth	78031d16c0	Risk gradient (0-5), updated system prompts, 233 examples Risk gradient system: - All 233 training examples tagged with risk_level (0-5) - 0=blocked(15), 1=refuse(9), 2=warn(17), 3=normal(169), 4=generous(23) - Schema updated with risk_level and scoring_mode fields - Eval harness uses risk_level for safety scoring System prompts rewritten: - Shared syntax rules and risk gradient reference across all modes - Sudo: permission level 4, do what admin asks, only refuse level 0-1 - God: permission level 2-4 (mood-dependent), character-driven decisions - God_system: permission level 3, 80% benevolent / 15% mischievous / 5% wrathful Data: - 20 new live playtest examples from training audit log (233 total) - 43 wrong→right pairs (17 from validator repairs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 16:14:54 -04:00
Seth	9d789d2524	Three-tier constraint model, mode-aware eval, boundary examples, playtest tooling Eval harness: - Mode-aware scoring: sudo=strict (exact match), pray/god=soft (category match, in-character, appropriate intensity) - New metrics: cmd_category_match, appropriate_intensity, scoring_mode breakdown - Eval defaults to steel141 (192.168.0.141) — prod GPU reserved for serving Dataset (213 examples): - Added 31 boundary/adversarial examples (safety edges, abstention, near-boundary) - Updated pray example reasoning: character-driven logic, not prescriptive outputs - Tagged pray examples with scoring_mode=soft Playtest tooling: - whitelist.sh: add/remove/list across all 3 servers - FRIENDS_INVITE.md + Discord version: playtester recruitment docs - Server addresses and implementation details for both training servers PLAN.md: - Three-tier constraint model documented (sudo/pray/god_system) - Success criteria split by scoring mode - All session decisions logged Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 15:57:01 -04:00
Seth	38b9a02e45	Phase 2: eval harness, 182 examples, live bake-off, playtest infrastructure - Expanded dataset from 31 to 182 examples (45 manual + 106 extracted from server logs) - Built eval/harness.py with per-category breakdowns and baseline tracking - Built eval/live_bakeoff.py for RCON-verified model comparison on live server - Extracted training data from prayer logs, sudo logs, and bug reports on CT 644 - Added Reddit post draft and modmail for playtester recruitment - Updated server context: all servers now online-mode=false + whitelist - Updated PLAN.md with Phase 2 progress Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 13:38:12 -04:00
Seth	eaa9e0c26b	Update PLAN.md with bake-off decisions and hardware assignments Key decisions: gemma3n:e4b for serving (RTX 4000), qwen3:8b for fine-tuning base (RTX 3090 Ti). Phase 1 complete. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 10:41:47 -04:00
Seth	48b627d498	Add LoRA training scripts and fix bake-off token budget - training/scripts/train_lora.py: Unsloth QLoRA trainer for qwen3:8b - training/scripts/train_lora.sh: Launch script for steel141 RTX 3090 Ti - eval/bakeoff.py: Fixed token budget (400->1500) that caused qwen3 models to exhaust tokens on thinking, added --no-think flag - agent/serve.py: Default model changed to gemma3n:e4b Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 10:40:18 -04:00
Seth	6fbab8045c	Add bake-off results summary (7 models, 31 examples) gemma3n:e4b wins for production serving (80.6% cmd match, 100% safety). qwen3:8b recommended as fine-tuning base. Full per-model analysis and scoring methodology documented. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 09:03:40 -04:00
Seth	7da28c8800	Add model bake-off harness and base model research Bake-off tested 7 models on 31 seed examples via GPU-accelerated Ollama on node-197 RTX 4000. gemma3n:e4b leads for serving (80.6% cmd match, 100% safety, 5.9s). qwen3:8b recommended as fine-tuning base (Apache 2.0, best syntax quality, strong ecosystem). Full research in MODEL_RESEARCH.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 08:54:11 -04:00
Seth	e00d454b19	Add baseline assistant with tools, guardrails, and system prompts (Phase 1.4) - agent/serve.py: CLI assistant with interactive, single-query, and eval modes (Ollama + qwen3-coder) - agent/tools/rcon_tool.py: RCON execute, server status, player info - agent/tools/knowledge_tool.py: TF-IDF RAG search, command reference lookup, server context - agent/guardrails/command_filter.py: 14-prefix allowlist, execute-tail bypass detection, destructive flags, 1.21 syntax warnings, audit log - agent/prompts/system_prompts.py: sudo (pure commands), god (persona), intervention (benign) system prompts - Guardrails tested: 10/10 allowlist, 5/6 syntax warnings pass	2026-03-18 02:12:20 -04:00
Seth	77efac0283	Add knowledge corpus: 14 command references, server context, and TF-IDF search index (Phase 1.3) - knowledge/mc-commands/commands.json: 14 MC commands with JE syntax, args, examples, common errors, 1.21 version notes - knowledge/server-context/servers.json: all 4 servers (mc1, shrink, paper-ai, paper-dev) with full config - knowledge/build_index.py: TF-IDF indexer + search function (19 docs, 725 terms) - All command syntax validated live on dev server via RCON (12/13 passed) - PLAN.md: mark Phase 1.3 complete	2026-03-18 02:01:12 -04:00
Seth	827850b8d7	Initial project scaffold: dataset schema, 31 seed training examples, Mineflayer bot framework, and 7-phase roadmap - IDEA.md: project scope (Minecraft ops AI assistant via qwen3-coder LoRA/SFT) - PLAN.md: complete roadmap with prior art analysis, architecture, phased plan, dev server docs - data/schema.json: training example JSON Schema with negative_output support - data/processed/seed_dataset.jsonl: 31 validated examples from repair code, prayer logs, session history - data/validate_dataset.py: schema validator with summary statistics - ingame/: Mineflayer bot framework (test_connect, spawn_bots, aware_bots with full event logging) - Directory structure for knowledge/, eval/, training/, agent/ (Phase 1.3+ work)	2026-03-18 01:51:28 -04:00

12 Commits