GPU scheduler, 14-tool architecture, plugin deployment, event dispatcher

GPU Scheduler (gpu.sethpc.xyz):
- Live dashboard with 4 GPUs, training monitor, loss sparklines
- Preset-based job scheduler with 3 triggers (time, finish_training, cost)
- Model selection per GPU, pipeline configuration
- Tool self-play and training pipeline types
- Behind Google OAuth, live-refresh without page reload

Tool Architecture (14 tools):
- 3 new tools: world.nearby_entities, memory.read, memory.write
- 7 script.* tools: write, validate, execute, read, list, delete, schedule
- ScriptManager: full mcfunction datapack CRUD with RCON validation
- Training data: 1,430 tool examples (up from 1,159)

Plugin Deployment (paper-ai-25567):
- WorldGuard 7.0.12, CoreProtect CE 23.1, EssentialsX 2.21.2, Vault 1.7.3
- Fresh greenfield world reset
- 104 RCON-validated plugin training examples

Event Dispatcher:
- Watches server log for deaths, joins, advancements, PvP kills
- Configurable trigger probability and cooldowns per event type
- Deployed to dev server, fires god_system prompts on events
- 21 event-response training examples

Training Infrastructure:
- train_lora.py: --save-steps 50, --resume from checkpoint
- run_training.sh: stops Ollama, activates conda, restarts after
- Passwordless sudo for ollama services on steel141
- Dev server added to MCSManager with autoStart

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Mortdecai
2026-03-21 03:14:45 -04:00
parent 434589d098
commit da8f557219
34 changed files with 7822 additions and 2 deletions
+16 -2
View File
@@ -157,6 +157,8 @@ def main():
parser.add_argument("--grad-accum", type=int, default=4, help="Gradient accumulation steps")
parser.add_argument("--max-seq-len", type=int, default=2048, help="Max sequence length")
parser.add_argument("--dry-run", action="store_true", help="Load model and dataset but don't train")
parser.add_argument("--save-steps", type=int, default=50, help="Save checkpoint every N steps")
parser.add_argument("--resume", action="store_true", help="Resume from latest checkpoint if available")
args = parser.parse_args()
# Auto-detect paths
@@ -258,13 +260,25 @@ def main():
weight_decay=0.01,
bf16=True,
logging_steps=1,
save_strategy="epoch",
save_strategy="steps",
save_steps=args.save_steps,
save_total_limit=3,
seed=42,
max_seq_length=args.max_seq_len,
dataset_text_field="text",
packing=True,
)
# Check for resume checkpoint
resume_ckpt = None
if args.resume:
ckpt_dir = Path(args.output)
if ckpt_dir.exists():
checkpoints = sorted(ckpt_dir.glob("checkpoint-*"), key=lambda p: int(p.name.split("-")[-1]))
if checkpoints:
resume_ckpt = str(checkpoints[-1])
print(f" Resuming from: {resume_ckpt}")
# Train
print(f"\nStarting training ({args.epochs} epochs, {len(train_data)} examples)...")
trainer = SFTTrainer(
@@ -274,7 +288,7 @@ def main():
args=training_args,
)
trainer.train()
trainer.train(resume_from_checkpoint=resume_ckpt)
# Save adapter
print(f"\nSaving LoRA adapter to {args.output}...")