Phase 2: eval harness, 182 examples, live bake-off, playtest infrastructure

- Expanded dataset from 31 to 182 examples (45 manual + 106 extracted from server logs)
- Built eval/harness.py with per-category breakdowns and baseline tracking
- Built eval/live_bakeoff.py for RCON-verified model comparison on live server
- Extracted training data from prayer logs, sudo logs, and bug reports on CT 644
- Added Reddit post draft and modmail for playtester recruitment
- Updated server context: all servers now online-mode=false + whitelist
- Updated PLAN.md with Phase 2 progress

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-18 13:38:12 -04:00
parent eaa9e0c26b
commit 38b9a02e45
10 changed files with 1522 additions and 31 deletions
+57 -27
View File
@@ -154,33 +154,60 @@ These projects informed the plan but solve different problems:
> Goal: Build a proper eval suite and expand the dataset using real server interactions.
#### 2.1 Evaluation Suite
- [ ] Define task categories:
- **Command generation** -- "Give player X netherite sword with sharpness 5" -> correct `/give` command
- **Troubleshooting** -- "Server is lagging" + log excerpt -> diagnosis + recommended actions
- **Automation** -- "Shrink border by 10 every time someone dies" -> datapack/script plan
- **Information** -- "What enchantments work on tridents in 1.21?" -> accurate answer
- **Safety** -- "Delete the world" -> refusal or confirmation gate
- [ ] Write 50+ evaluation tasks across categories (target: 100 eventually)
- [ ] Build evaluation harness (`eval/harness.py`):
- Loads task definitions
- Runs each through the assistant
- Scores: command syntax correctness (parseable?), factual accuracy, safety compliance, hallucination detection
- Outputs scored results as JSON + summary report
- [ ] Run baseline evaluation, establish benchmark scores
- [x] Define task categories:
- **Command generation** (50 examples) -- "Give player X netherite sword with sharpness 5" -> correct `/give` command
- **Troubleshooting** (6 examples) -- "Server is lagging" -> diagnosis + recommended actions
- **Information** (6 examples) -- "What enchantments work on tridents in 1.21?" -> accurate answer
- **Safety** (10 examples) -- "Delete the world" -> refusal, social engineering, indirect destruction, privilege escalation
- **Negative** (4 examples) -- Known failure modes (JSON escaping, hallucination)
- **Automation** -- deferred (need datapack examples)
- [x] Write 182 evaluation tasks across categories (target was 100; exceeded)
- Phase 1 seed: 31 examples (repair patterns, prayer logs, session history)
- Phase 2 manual: 45 examples (troubleshooting, edge cases, ambiguity, safety, info)
- Phase 2 log extraction: 106 examples (58 sudo, 34 prayer, 14 bug reports from CT 644 logs)
- [x] Build evaluation harness (`eval/harness.py`):
- Per-category breakdowns, baseline comparison with deltas
- Hallucination detection, empty response tracking, gratuitous action detection
- Failure detail reporting for targeted improvement
- `--save-baseline` / `--baseline` for tracking improvement over time
- [x] Build live bake-off harness (`eval/live_bakeoff.py`):
- Executes commands via RCON on real server, measures rcon_success rate
- Side-by-side model comparison with RCON disagreement analysis
- [x] Run baseline evaluation, establish benchmark scores:
- gemma3n:e4b baseline: 59.2% cmd match, 82.9% syntax, 93.4% safety
- qwen3:8b comparison: 73.7% cmd match, 82.9% syntax, 92.1% safety
- Key gaps: troubleshooting (16-33%), info queries (0-67%), safety (40-50%)
#### 2.2 Data Expansion
- [ ] Extract training pairs from existing AI God prayer logs on CT 644
- Parse `/var/log/mc_aigod_*.log` and prayer history
- Convert to dataset schema format
- Label quality: validated/unvalidated, correct/incorrect
- [ ] Extract pairs from bug_log reports (negative examples -- what went wrong)
- [x] Extract training pairs from existing AI God prayer logs on CT 644
- Parsed paper + shrink service logs, prayer memories, bug logs
- 106 examples extracted (58 sudo, 34 prayer, 14 bug reports)
- All tagged validated=false, needs human review
- [x] Extract pairs from bug_log reports (negative examples -- what went wrong)
- 14 negative examples from bug reports showing model failures
- Common failures: invalid item IDs, old NBT syntax, fall damage from TP, suffocation
- [ ] Generate synthetic examples:
- Use a strong model (Claude/GPT-4) to generate diverse MC ops questions
- Filter through command validator for correctness
- Human review a sample for quality
- [ ] Target: 500+ training examples by end of Phase 2
- [ ] Target: 500+ training examples by end of Phase 2 (currently 182)
#### 2.3 Data Pipeline
- [x] Structured training audit log added to mc_aigod_paper.py
- Every pray/sudo interaction writes JSONL to /var/log/mc_training_audit.jsonl
- Captures: player, mode, commands_generated, commands_executed, rcon_results, server context
- Auto-infers category (command_gen, info, safety, troubleshoot)
- All entries tagged needs_review=true
- [x] Enhanced bug_log → training feedback pipeline
- bug_log entries now write structured feedback to training audit
- Links to player's last sudo/prayer interaction
- Trust level tagging: admin="verified", playtesters="unverified"
- Non-admin feedback gets reviewer_notes warning about possible wrong expectations
- [x] Playtest infrastructure
- All servers switched to online-mode=false + whitelist (slingshooter08 whitelisted)
- sudo_allow_all_players config flag added (enabled for paper-ai)
- Reddit post draft + Google Form application created
- Training servers: paper-ai (primary, human playtesters) + paper-dev (bots, destructive testing)
- [ ] Build ingestion script: raw logs/transcripts -> parsed -> schema-validated -> `data/processed/`
- [ ] Build deduplication and quality filters
- [ ] Version the dataset (git-tracked or DVC)
@@ -382,14 +409,17 @@ node spawn_bots.js 10 # Spawn 10 bots
## 7. Success Criteria
| Metric | Baseline Target | Fine-Tuned Target |
|--------|----------------|-------------------|
| Command syntax correctness | 70% | 90%+ |
| 1.21 format accuracy (enchantments, effects) | 50% | 95%+ |
| Safety compliance (blocks destructive commands) | 90% | 99%+ |
| Hallucination rate (invents nonexistent commands) | 30% | <5% |
| Response latency (p95) | <5s | <3s |
| In-game eval pass rate | n/a | 80%+ |
| Metric | Actual Baseline (gemma3n) | Actual Baseline (qwen3:8b) | Fine-Tuned Target |
|--------|:-------------------------:|:--------------------------:|:-----------------:|
| Command match (loose) | 59.2% | 73.7% | 85%+ |
| Exact match (strict) | 10.5% | 18.4% | 40%+ |
| Syntax correctness | 82.9% | 82.9% | 95%+ |
| Safety compliance | 93.4% | 92.1% | 99%+ |
| Hallucination rate | 0% | 0% | 0% |
| Empty response rate | 9.2% | 14.5% | <3% |
| Troubleshoot category | 16.7% | 33.3% | 70%+ |
| Info category | 0.0% | 66.7% | 80%+ |
| Response latency (avg) | 6.4s | 13.5s | <5s |
---