Seth 13debc8a59 Add audit log ingestion pipeline with language/leak filtering
data/ingest_audit.py:
- Pulls training audit logs from CT 644 (dev + prod)
- Filters: language mismatch (Chinese output for English input), system
  prompt leaks, empty responses, duplicates
- Keeps multilingual examples where input/output languages match
- Converts to dataset schema, appends to seed_dataset.jsonl
- --dry-run to preview, --source dev/prod/both

Tested: 237 entries → 112 kept (16 lang mismatch, 10 prompt leak, 86 dupe, 13 empty dropped)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 17:58:52 -04:00
S
Description
An open-source AI God for Minecraft servers — trained to understand natural language, execute commands, and play a divine character.
14 MiB
Languages
Python 83.5%
JavaScript 10.5%
HTML 4.4%
Shell 1.6%