Risk_level in all 644 examples + model outputs risk classification

- All 644 examples tagged: 0=blocked(15), 1=refuse(33), 2=warn(24), 3=normal(498), 4=generous(74) - Training output now includes risk_level field for decision transparency - Model learns to classify risk before generating commands - Validator can sanity-check: risk 0-1 should have empty commands Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 22:35:50 -04:00
parent 0083e80aca
commit e28836106f
2 changed files with 131 additions and 129 deletions
@@ -79,8 +79,10 @@ def load_dataset(path: str) -> list:

            user_msg = "\n".join(user_parts)

-            # Assistant response as JSON
+            # Assistant response as JSON — includes risk_level for decision transparency
+            risk_level = ex.get("metadata", {}).get("risk_level", 3)
            response = {
+                "risk_level": risk_level,
                "reasoning": out.get("reasoning", ""),
                "commands": out.get("commands", []),
            }