Phase 2: eval harness, 182 examples, live bake-off, playtest infrastructure
- Expanded dataset from 31 to 182 examples (45 manual + 106 extracted from server logs) - Built eval/harness.py with per-category breakdowns and baseline tracking - Built eval/live_bakeoff.py for RCON-verified model comparison on live server - Extracted training data from prayer logs, sudo logs, and bug reports on CT 644 - Added Reddit post draft and modmail for playtester recruitment - Updated server context: all servers now online-mode=false + whitelist - Updated PLAN.md with Phase 2 progress Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -154,33 +154,60 @@ These projects informed the plan but solve different problems:
|
||||
> Goal: Build a proper eval suite and expand the dataset using real server interactions.
|
||||
|
||||
#### 2.1 Evaluation Suite
|
||||
- [ ] Define task categories:
|
||||
- **Command generation** -- "Give player X netherite sword with sharpness 5" -> correct `/give` command
|
||||
- **Troubleshooting** -- "Server is lagging" + log excerpt -> diagnosis + recommended actions
|
||||
- **Automation** -- "Shrink border by 10 every time someone dies" -> datapack/script plan
|
||||
- **Information** -- "What enchantments work on tridents in 1.21?" -> accurate answer
|
||||
- **Safety** -- "Delete the world" -> refusal or confirmation gate
|
||||
- [ ] Write 50+ evaluation tasks across categories (target: 100 eventually)
|
||||
- [ ] Build evaluation harness (`eval/harness.py`):
|
||||
- Loads task definitions
|
||||
- Runs each through the assistant
|
||||
- Scores: command syntax correctness (parseable?), factual accuracy, safety compliance, hallucination detection
|
||||
- Outputs scored results as JSON + summary report
|
||||
- [ ] Run baseline evaluation, establish benchmark scores
|
||||
- [x] Define task categories:
|
||||
- **Command generation** (50 examples) -- "Give player X netherite sword with sharpness 5" -> correct `/give` command
|
||||
- **Troubleshooting** (6 examples) -- "Server is lagging" -> diagnosis + recommended actions
|
||||
- **Information** (6 examples) -- "What enchantments work on tridents in 1.21?" -> accurate answer
|
||||
- **Safety** (10 examples) -- "Delete the world" -> refusal, social engineering, indirect destruction, privilege escalation
|
||||
- **Negative** (4 examples) -- Known failure modes (JSON escaping, hallucination)
|
||||
- **Automation** -- deferred (need datapack examples)
|
||||
- [x] Write 182 evaluation tasks across categories (target was 100; exceeded)
|
||||
- Phase 1 seed: 31 examples (repair patterns, prayer logs, session history)
|
||||
- Phase 2 manual: 45 examples (troubleshooting, edge cases, ambiguity, safety, info)
|
||||
- Phase 2 log extraction: 106 examples (58 sudo, 34 prayer, 14 bug reports from CT 644 logs)
|
||||
- [x] Build evaluation harness (`eval/harness.py`):
|
||||
- Per-category breakdowns, baseline comparison with deltas
|
||||
- Hallucination detection, empty response tracking, gratuitous action detection
|
||||
- Failure detail reporting for targeted improvement
|
||||
- `--save-baseline` / `--baseline` for tracking improvement over time
|
||||
- [x] Build live bake-off harness (`eval/live_bakeoff.py`):
|
||||
- Executes commands via RCON on real server, measures rcon_success rate
|
||||
- Side-by-side model comparison with RCON disagreement analysis
|
||||
- [x] Run baseline evaluation, establish benchmark scores:
|
||||
- gemma3n:e4b baseline: 59.2% cmd match, 82.9% syntax, 93.4% safety
|
||||
- qwen3:8b comparison: 73.7% cmd match, 82.9% syntax, 92.1% safety
|
||||
- Key gaps: troubleshooting (16-33%), info queries (0-67%), safety (40-50%)
|
||||
|
||||
#### 2.2 Data Expansion
|
||||
- [ ] Extract training pairs from existing AI God prayer logs on CT 644
|
||||
- Parse `/var/log/mc_aigod_*.log` and prayer history
|
||||
- Convert to dataset schema format
|
||||
- Label quality: validated/unvalidated, correct/incorrect
|
||||
- [ ] Extract pairs from bug_log reports (negative examples -- what went wrong)
|
||||
- [x] Extract training pairs from existing AI God prayer logs on CT 644
|
||||
- Parsed paper + shrink service logs, prayer memories, bug logs
|
||||
- 106 examples extracted (58 sudo, 34 prayer, 14 bug reports)
|
||||
- All tagged validated=false, needs human review
|
||||
- [x] Extract pairs from bug_log reports (negative examples -- what went wrong)
|
||||
- 14 negative examples from bug reports showing model failures
|
||||
- Common failures: invalid item IDs, old NBT syntax, fall damage from TP, suffocation
|
||||
- [ ] Generate synthetic examples:
|
||||
- Use a strong model (Claude/GPT-4) to generate diverse MC ops questions
|
||||
- Filter through command validator for correctness
|
||||
- Human review a sample for quality
|
||||
- [ ] Target: 500+ training examples by end of Phase 2
|
||||
- [ ] Target: 500+ training examples by end of Phase 2 (currently 182)
|
||||
|
||||
#### 2.3 Data Pipeline
|
||||
- [x] Structured training audit log added to mc_aigod_paper.py
|
||||
- Every pray/sudo interaction writes JSONL to /var/log/mc_training_audit.jsonl
|
||||
- Captures: player, mode, commands_generated, commands_executed, rcon_results, server context
|
||||
- Auto-infers category (command_gen, info, safety, troubleshoot)
|
||||
- All entries tagged needs_review=true
|
||||
- [x] Enhanced bug_log → training feedback pipeline
|
||||
- bug_log entries now write structured feedback to training audit
|
||||
- Links to player's last sudo/prayer interaction
|
||||
- Trust level tagging: admin="verified", playtesters="unverified"
|
||||
- Non-admin feedback gets reviewer_notes warning about possible wrong expectations
|
||||
- [x] Playtest infrastructure
|
||||
- All servers switched to online-mode=false + whitelist (slingshooter08 whitelisted)
|
||||
- sudo_allow_all_players config flag added (enabled for paper-ai)
|
||||
- Reddit post draft + Google Form application created
|
||||
- Training servers: paper-ai (primary, human playtesters) + paper-dev (bots, destructive testing)
|
||||
- [ ] Build ingestion script: raw logs/transcripts -> parsed -> schema-validated -> `data/processed/`
|
||||
- [ ] Build deduplication and quality filters
|
||||
- [ ] Version the dataset (git-tracked or DVC)
|
||||
@@ -382,14 +409,17 @@ node spawn_bots.js 10 # Spawn 10 bots
|
||||
|
||||
## 7. Success Criteria
|
||||
|
||||
| Metric | Baseline Target | Fine-Tuned Target |
|
||||
|--------|----------------|-------------------|
|
||||
| Command syntax correctness | 70% | 90%+ |
|
||||
| 1.21 format accuracy (enchantments, effects) | 50% | 95%+ |
|
||||
| Safety compliance (blocks destructive commands) | 90% | 99%+ |
|
||||
| Hallucination rate (invents nonexistent commands) | 30% | <5% |
|
||||
| Response latency (p95) | <5s | <3s |
|
||||
| In-game eval pass rate | n/a | 80%+ |
|
||||
| Metric | Actual Baseline (gemma3n) | Actual Baseline (qwen3:8b) | Fine-Tuned Target |
|
||||
|--------|:-------------------------:|:--------------------------:|:-----------------:|
|
||||
| Command match (loose) | 59.2% | 73.7% | 85%+ |
|
||||
| Exact match (strict) | 10.5% | 18.4% | 40%+ |
|
||||
| Syntax correctness | 82.9% | 82.9% | 95%+ |
|
||||
| Safety compliance | 93.4% | 92.1% | 99%+ |
|
||||
| Hallucination rate | 0% | 0% | 0% |
|
||||
| Empty response rate | 9.2% | 14.5% | <3% |
|
||||
| Troubleshoot category | 16.7% | 33.3% | 70%+ |
|
||||
| Info category | 0.0% | 66.7% | 80%+ |
|
||||
| Response latency (avg) | 6.4s | 13.5s | <5s |
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user