GPU scheduler, 14-tool architecture, plugin deployment, event dispatcher
GPU Scheduler (gpu.sethpc.xyz): - Live dashboard with 4 GPUs, training monitor, loss sparklines - Preset-based job scheduler with 3 triggers (time, finish_training, cost) - Model selection per GPU, pipeline configuration - Tool self-play and training pipeline types - Behind Google OAuth, live-refresh without page reload Tool Architecture (14 tools): - 3 new tools: world.nearby_entities, memory.read, memory.write - 7 script.* tools: write, validate, execute, read, list, delete, schedule - ScriptManager: full mcfunction datapack CRUD with RCON validation - Training data: 1,430 tool examples (up from 1,159) Plugin Deployment (paper-ai-25567): - WorldGuard 7.0.12, CoreProtect CE 23.1, EssentialsX 2.21.2, Vault 1.7.3 - Fresh greenfield world reset - 104 RCON-validated plugin training examples Event Dispatcher: - Watches server log for deaths, joins, advancements, PvP kills - Configurable trigger probability and cooldowns per event type - Deployed to dev server, fires god_system prompts on events - 21 event-response training examples Training Infrastructure: - train_lora.py: --save-steps 50, --resume from checkpoint - run_training.sh: stops Ollama, activates conda, restarts after - Passwordless sudo for ollama services on steel141 - Dev server added to MCSManager with autoStart Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -157,6 +157,8 @@ def main():
|
||||
parser.add_argument("--grad-accum", type=int, default=4, help="Gradient accumulation steps")
|
||||
parser.add_argument("--max-seq-len", type=int, default=2048, help="Max sequence length")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Load model and dataset but don't train")
|
||||
parser.add_argument("--save-steps", type=int, default=50, help="Save checkpoint every N steps")
|
||||
parser.add_argument("--resume", action="store_true", help="Resume from latest checkpoint if available")
|
||||
args = parser.parse_args()
|
||||
|
||||
# Auto-detect paths
|
||||
@@ -258,13 +260,25 @@ def main():
|
||||
weight_decay=0.01,
|
||||
bf16=True,
|
||||
logging_steps=1,
|
||||
save_strategy="epoch",
|
||||
save_strategy="steps",
|
||||
save_steps=args.save_steps,
|
||||
save_total_limit=3,
|
||||
seed=42,
|
||||
max_seq_length=args.max_seq_len,
|
||||
dataset_text_field="text",
|
||||
packing=True,
|
||||
)
|
||||
|
||||
# Check for resume checkpoint
|
||||
resume_ckpt = None
|
||||
if args.resume:
|
||||
ckpt_dir = Path(args.output)
|
||||
if ckpt_dir.exists():
|
||||
checkpoints = sorted(ckpt_dir.glob("checkpoint-*"), key=lambda p: int(p.name.split("-")[-1]))
|
||||
if checkpoints:
|
||||
resume_ckpt = str(checkpoints[-1])
|
||||
print(f" Resuming from: {resume_ckpt}")
|
||||
|
||||
# Train
|
||||
print(f"\nStarting training ({args.epochs} epochs, {len(train_data)} examples)...")
|
||||
trainer = SFTTrainer(
|
||||
@@ -274,7 +288,7 @@ def main():
|
||||
args=training_args,
|
||||
)
|
||||
|
||||
trainer.train()
|
||||
trainer.train(resume_from_checkpoint=resume_ckpt)
|
||||
|
||||
# Save adapter
|
||||
print(f"\nSaving LoRA adapter to {args.output}...")
|
||||
|
||||
Reference in New Issue
Block a user