data/ingest_audit.py:
- Pulls training audit logs from CT 644 (dev + prod)
- Filters: language mismatch (Chinese output for English input), system
prompt leaks, empty responses, duplicates
- Keeps multilingual examples where input/output languages match
- Converts to dataset schema, appends to seed_dataset.jsonl
- --dry-run to preview, --source dev/prod/both
Tested: 237 entries → 112 kept (16 lang mismatch, 10 prompt leak, 86 dupe, 13 empty dropped)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>