Add baseline assistant with tools, guardrails, and system prompts (Phase 1.4)

- agent/serve.py: CLI assistant with interactive, single-query, and eval modes (Ollama + qwen3-coder) - agent/tools/rcon_tool.py: RCON execute, server status, player info - agent/tools/knowledge_tool.py: TF-IDF RAG search, command reference lookup, server context - agent/guardrails/command_filter.py: 14-prefix allowlist, execute-tail bypass detection, destructive flags, 1.21 syntax warnings, audit log - agent/prompts/system_prompts.py: sudo (pure commands), god (persona), intervention (benign) system prompts - Guardrails tested: 10/10 allowlist, 5/6 syntax warnings pass
2026-03-18 02:12:20 -04:00
parent 77efac0283
commit e00d454b19
10 changed files with 815 additions and 12 deletions
@@ -130,18 +130,21 @@ These projects informed the plan but solve different problems:
 - [x] Validated with 6 test queries -- all return relevant top results

 #### 1.4 Baseline Assistant (No Fine-Tuning)
- [ ] Build prompt-only assistant using `qwen3-coder` (via Ollama at 192.168.0.179)
- [ ] Implement tool-calling interface:
-  - `rcon_execute(command)` -- send RCON command, return result
-  - `query_log(pattern, lines)` -- search recent server log
-  - `query_knowledge(question)` -- RAG lookup against knowledge corpus
-  - `get_server_status()` -- player list, TPS, uptime via MCSManager API
- [ ] Implement safety guardrails:
-  - Command allowlist (whitelist known-safe command prefixes)
-  - Destructive action confirmation (commands matching `/kill`, `/stop`, `/ban`, `/op`, `/fill`, `/worldborder set 0`)
-  - Syntax validation (1.21 enchantment format, weather values, effect names)
-  - Audit log (every command attempted + result, timestamped JSON)
- [ ] Test baseline on 20 seed examples, record accuracy manually
+- [x] Build prompt-only assistant (`agent/serve.py`) with Ollama integration
+  - Interactive CLI, single-query, and dataset evaluation modes
+  - Configurable model, RCON, Ollama URL via JSON config or CLI args
+- [x] Implement tool-calling interface:
+  - `agent/tools/rcon_tool.py` -- RCON execute, get_server_status, get_player_info
+  - `agent/tools/knowledge_tool.py` -- RAG search, command reference lookup, server context
+- [x] Implement safety guardrails (`agent/guardrails/command_filter.py`):
+  - Command allowlist (14 safe prefixes, blocks /stop /op /ban etc.)
+  - Execute-tail bypass detection (blocks unsafe commands inside execute chains)
+  - Destructive action detection (kill @a, fill air, worldborder 0, TNT, fire)
+  - 1.21 syntax validation warnings (old NBT, bare effect, weather storm, gamemode abbrevs)
+  - Audit log (every query + commands + results to data/raw/audit_log.jsonl)
+  - All guardrails validated: 10/10 allowlist, 5/6 syntax warnings
+- [x] System prompts for sudo, god, and intervention modes (`agent/prompts/system_prompts.py`)
+- [ ] Run baseline evaluation on seed dataset, record accuracy
 - [ ] Document baseline performance as the bar to beat

 ---