GPU scheduler, 14-tool architecture, plugin deployment, event dispatcher

GPU Scheduler (gpu.sethpc.xyz): - Live dashboard with 4 GPUs, training monitor, loss sparklines - Preset-based job scheduler with 3 triggers (time, finish_training, cost) - Model selection per GPU, pipeline configuration - Tool self-play and training pipeline types - Behind Google OAuth, live-refresh without page reload Tool Architecture (14 tools): - 3 new tools: world.nearby_entities, memory.read, memory.write - 7 script.* tools: write, validate, execute, read, list, delete, schedule - ScriptManager: full mcfunction datapack CRUD with RCON validation - Training data: 1,430 tool examples (up from 1,159) Plugin Deployment (paper-ai-25567): - WorldGuard 7.0.12, CoreProtect CE 23.1, EssentialsX 2.21.2, Vault 1.7.3 - Fresh greenfield world reset - 104 RCON-validated plugin training examples Event Dispatcher: - Watches server log for deaths, joins, advancements, PvP kills - Configurable trigger probability and cooldowns per event type - Deployed to dev server, fires god_system prompts on events - 21 event-response training examples Training Infrastructure: - train_lora.py: --save-steps 50, --resume from checkpoint - run_training.sh: stops Ollama, activates conda, restarts after - Passwordless sudo for ollama services on steel141 - Dev server added to MCSManager with autoStart Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 03:14:45 -04:00
parent 434589d098
commit da8f557219
34 changed files with 7822 additions and 2 deletions
@@ -157,6 +157,8 @@ def main():
    parser.add_argument("--grad-accum", type=int, default=4, help="Gradient accumulation steps")
    parser.add_argument("--max-seq-len", type=int, default=2048, help="Max sequence length")
    parser.add_argument("--dry-run", action="store_true", help="Load model and dataset but don't train")
+    parser.add_argument("--save-steps", type=int, default=50, help="Save checkpoint every N steps")
+    parser.add_argument("--resume", action="store_true", help="Resume from latest checkpoint if available")
    args = parser.parse_args()

    # Auto-detect paths
@@ -258,13 +260,25 @@ def main():
        weight_decay=0.01,
        bf16=True,
        logging_steps=1,
-        save_strategy="epoch",
+        save_strategy="steps",
+        save_steps=args.save_steps,
+        save_total_limit=3,
        seed=42,
        max_seq_length=args.max_seq_len,
        dataset_text_field="text",
        packing=True,
    )

+    # Check for resume checkpoint
+    resume_ckpt = None
+    if args.resume:
+        ckpt_dir = Path(args.output)
+        if ckpt_dir.exists():
+            checkpoints = sorted(ckpt_dir.glob("checkpoint-*"), key=lambda p: int(p.name.split("-")[-1]))
+            if checkpoints:
+                resume_ckpt = str(checkpoints[-1])
+                print(f"  Resuming from: {resume_ckpt}")
+
    # Train
    print(f"\nStarting training ({args.epochs} epochs, {len(train_data)} examples)...")
    trainer = SFTTrainer(
@@ -274,7 +288,7 @@ def main():
        args=training_args,
    )

-    trainer.train()
+    trainer.train(resume_from_checkpoint=resume_ckpt)

    # Save adapter
    print(f"\nSaving LoRA adapter to {args.output}...")