Round 5: Live RCON bake-off results (preliminary)

- Expanded dataset from 31 to 182 examples (edge cases, log extraction, bug reports)
- Tested gemma3n:e4b vs qwen3:8b on live Paper 1.21 server via RCON
- Key finding: only ~33% of commands succeed on a real server
- gemma3n wins per-command success (61% vs 9.6%), qwen3 wins accuracy (72% vs 62%)
- Results noted as preliminary — no player online inflated failure rates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-18 15:09:10 -04:00
parent 2189579490
commit 33e3e55770
3 changed files with 19438 additions and 0 deletions
File diff suppressed because one or more lines are too long