Risk_level in all 644 examples + model outputs risk classification

- All 644 examples tagged: 0=blocked(15), 1=refuse(33), 2=warn(24), 3=normal(498), 4=generous(74)
- Training output now includes risk_level field for decision transparency
- Model learns to classify risk before generating commands
- Validator can sanity-check: risk 0-1 should have empty commands

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-18 22:35:50 -04:00
parent 0083e80aca
commit e28836106f
2 changed files with 131 additions and 129 deletions
+3 -1
View File
@@ -79,8 +79,10 @@ def load_dataset(path: str) -> list:
user_msg = "\n".join(user_parts)
# Assistant response as JSON
# Assistant response as JSON — includes risk_level for decision transparency
risk_level = ex.get("metadata", {}).get("risk_level", 3)
response = {
"risk_level": risk_level,
"reasoning": out.get("reasoning", ""),
"commands": out.get("commands", []),
}