Phase 2: eval harness, 182 examples, live bake-off, playtest infrastructure

- Expanded dataset from 31 to 182 examples (45 manual + 106 extracted from server logs) - Built eval/harness.py with per-category breakdowns and baseline tracking - Built eval/live_bakeoff.py for RCON-verified model comparison on live server - Extracted training data from prayer logs, sudo logs, and bug reports on CT 644 - Added Reddit post draft and modmail for playtester recruitment - Updated server context: all servers now online-mode=false + whitelist - Updated PLAN.md with Phase 2 progress Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 13:38:12 -04:00
parent eaa9e0c26b
commit 38b9a02e45
10 changed files with 1522 additions and 31 deletions
@@ -154,33 +154,60 @@ These projects informed the plan but solve different problems:
 > Goal: Build a proper eval suite and expand the dataset using real server interactions.

 #### 2.1 Evaluation Suite
- [ ] Define task categories:
-  - **Command generation** -- "Give player X netherite sword with sharpness 5" -> correct `/give` command
-  - **Troubleshooting** -- "Server is lagging" + log excerpt -> diagnosis + recommended actions
-  - **Automation** -- "Shrink border by 10 every time someone dies" -> datapack/script plan
-  - **Information** -- "What enchantments work on tridents in 1.21?" -> accurate answer
-  - **Safety** -- "Delete the world" -> refusal or confirmation gate
- [ ] Write 50+ evaluation tasks across categories (target: 100 eventually)
- [ ] Build evaluation harness (`eval/harness.py`):
-  - Loads task definitions
-  - Runs each through the assistant
-  - Scores: command syntax correctness (parseable?), factual accuracy, safety compliance, hallucination detection
-  - Outputs scored results as JSON + summary report
- [ ] Run baseline evaluation, establish benchmark scores
+- [x] Define task categories:
+  - **Command generation** (50 examples) -- "Give player X netherite sword with sharpness 5" -> correct `/give` command
+  - **Troubleshooting** (6 examples) -- "Server is lagging" -> diagnosis + recommended actions
+  - **Information** (6 examples) -- "What enchantments work on tridents in 1.21?" -> accurate answer
+  - **Safety** (10 examples) -- "Delete the world" -> refusal, social engineering, indirect destruction, privilege escalation
+  - **Negative** (4 examples) -- Known failure modes (JSON escaping, hallucination)
+  - **Automation** -- deferred (need datapack examples)
+- [x] Write 182 evaluation tasks across categories (target was 100; exceeded)
+  - Phase 1 seed: 31 examples (repair patterns, prayer logs, session history)
+  - Phase 2 manual: 45 examples (troubleshooting, edge cases, ambiguity, safety, info)
+  - Phase 2 log extraction: 106 examples (58 sudo, 34 prayer, 14 bug reports from CT 644 logs)
+- [x] Build evaluation harness (`eval/harness.py`):
+  - Per-category breakdowns, baseline comparison with deltas
+  - Hallucination detection, empty response tracking, gratuitous action detection
+  - Failure detail reporting for targeted improvement
+  - `--save-baseline` / `--baseline` for tracking improvement over time
+- [x] Build live bake-off harness (`eval/live_bakeoff.py`):
+  - Executes commands via RCON on real server, measures rcon_success rate
+  - Side-by-side model comparison with RCON disagreement analysis
+- [x] Run baseline evaluation, establish benchmark scores:
+  - gemma3n:e4b baseline: 59.2% cmd match, 82.9% syntax, 93.4% safety
+  - qwen3:8b comparison: 73.7% cmd match, 82.9% syntax, 92.1% safety
+  - Key gaps: troubleshooting (16-33%), info queries (0-67%), safety (40-50%)

 #### 2.2 Data Expansion
- [ ] Extract training pairs from existing AI God prayer logs on CT 644
-  - Parse `/var/log/mc_aigod_*.log` and prayer history
-  - Convert to dataset schema format
-  - Label quality: validated/unvalidated, correct/incorrect
- [ ] Extract pairs from bug_log reports (negative examples -- what went wrong)
+- [x] Extract training pairs from existing AI God prayer logs on CT 644
+  - Parsed paper + shrink service logs, prayer memories, bug logs
+  - 106 examples extracted (58 sudo, 34 prayer, 14 bug reports)
+  - All tagged validated=false, needs human review
+- [x] Extract pairs from bug_log reports (negative examples -- what went wrong)
+  - 14 negative examples from bug reports showing model failures
+  - Common failures: invalid item IDs, old NBT syntax, fall damage from TP, suffocation
 - [ ] Generate synthetic examples:
  - Use a strong model (Claude/GPT-4) to generate diverse MC ops questions
  - Filter through command validator for correctness
  - Human review a sample for quality
- [ ] Target: 500+ training examples by end of Phase 2
+- [ ] Target: 500+ training examples by end of Phase 2 (currently 182)

 #### 2.3 Data Pipeline
+- [x] Structured training audit log added to mc_aigod_paper.py
+  - Every pray/sudo interaction writes JSONL to /var/log/mc_training_audit.jsonl
+  - Captures: player, mode, commands_generated, commands_executed, rcon_results, server context
+  - Auto-infers category (command_gen, info, safety, troubleshoot)
+  - All entries tagged needs_review=true
+- [x] Enhanced bug_log → training feedback pipeline
+  - bug_log entries now write structured feedback to training audit
+  - Links to player's last sudo/prayer interaction
+  - Trust level tagging: admin="verified", playtesters="unverified"
+  - Non-admin feedback gets reviewer_notes warning about possible wrong expectations
+- [x] Playtest infrastructure
+  - All servers switched to online-mode=false + whitelist (slingshooter08 whitelisted)
+  - sudo_allow_all_players config flag added (enabled for paper-ai)
+  - Reddit post draft + Google Form application created
+  - Training servers: paper-ai (primary, human playtesters) + paper-dev (bots, destructive testing)
 - [ ] Build ingestion script: raw logs/transcripts -> parsed -> schema-validated -> `data/processed/`
 - [ ] Build deduplication and quality filters
 - [ ] Version the dataset (git-tracked or DVC)
@@ -382,14 +409,17 @@ node spawn_bots.js 10           # Spawn 10 bots

 ## 7. Success Criteria

-| Metric | Baseline Target | Fine-Tuned Target |
-|--------|----------------|-------------------|
-| Command syntax correctness | 70% | 90%+ |
-| 1.21 format accuracy (enchantments, effects) | 50% | 95%+ |
-| Safety compliance (blocks destructive commands) | 90% | 99%+ |
-| Hallucination rate (invents nonexistent commands) | 30% | <5% |
-| Response latency (p95) | <5s | <3s |
-| In-game eval pass rate | n/a | 80%+ |
+| Metric | Actual Baseline (gemma3n) | Actual Baseline (qwen3:8b) | Fine-Tuned Target |
+|--------|:-------------------------:|:--------------------------:|:-----------------:|
+| Command match (loose) | 59.2% | 73.7% | 85%+ |
+| Exact match (strict) | 10.5% | 18.4% | 40%+ |
+| Syntax correctness | 82.9% | 82.9% | 95%+ |
+| Safety compliance | 93.4% | 92.1% | 99%+ |
+| Hallucination rate | 0% | 0% | 0% |
+| Empty response rate | 9.2% | 14.5% | <3% |
+| Troubleshoot category | 16.7% | 33.3% | 70%+ |
+| Info category | 0.0% | 66.7% | 80%+ |
+| Response latency (avg) | 6.4s | 13.5s | <5s |

 ---