Risk_level in all 644 examples + model outputs risk classification
- All 644 examples tagged: 0=blocked(15), 1=refuse(33), 2=warn(24), 3=normal(498), 4=generous(74) - Training output now includes risk_level field for decision transparency - Model learns to classify risk before generating commands - Validator can sanity-check: risk 0-1 should have empty commands Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -79,8 +79,10 @@ def load_dataset(path: str) -> list:
|
||||
|
||||
user_msg = "\n".join(user_parts)
|
||||
|
||||
# Assistant response as JSON
|
||||
# Assistant response as JSON — includes risk_level for decision transparency
|
||||
risk_level = ex.get("metadata", {}).get("risk_level", 3)
|
||||
response = {
|
||||
"risk_level": risk_level,
|
||||
"reasoning": out.get("reasoning", ""),
|
||||
"commands": out.get("commands", []),
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user