Files

T

Seth 9d789d2524 Three-tier constraint model, mode-aware eval, boundary examples, playtest tooling

Eval harness:
- Mode-aware scoring: sudo=strict (exact match), pray/god=soft (category match,
  in-character, appropriate intensity)
- New metrics: cmd_category_match, appropriate_intensity, scoring_mode breakdown
- Eval defaults to steel141 (192.168.0.141) — prod GPU reserved for serving

Dataset (213 examples):
- Added 31 boundary/adversarial examples (safety edges, abstention, near-boundary)
- Updated pray example reasoning: character-driven logic, not prescriptive outputs
- Tagged pray examples with scoring_mode=soft

Playtest tooling:
- whitelist.sh: add/remove/list across all 3 servers
- FRIENDS_INVITE.md + Discord version: playtester recruitment docs
- Server addresses and implementation details for both training servers

PLAN.md:
- Three-tier constraint model documented (sudo/pray/god_system)
- Success criteria split by scoring mode
- All session decisions logged

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-18 15:57:01 -04:00

5.4 KiB

Raw Blame History

Playtest My Minecraft AI — I Need Your Bad Ideas

Hey — I built something for my Minecraft server that I think you'll get a kick out of, and I need people to come break it.

What It Is

I have an AI running on my server that listens to in-game chat and does things in the world based on what you say. Two modes:

sudo <anything> — Talk to the server in plain English. "sudo give me a diamond sword with sharpness 5" and it just... does it. "sudo build a house here" and it places blocks. "sudo kill all the zombies" and they die. It translates whatever you type into real server commands and runs them live.

pray <anything> — Talk to God. Literally. There's an AI character playing God on the server. Pray for items, pray for smiting your enemies, pray something offensive and get punished. It responds in-character with dramatic messages and then actually grants or denies your request with real effects, items, lightning bolts, whatever it decides.

There's also bug_log <what happened> — if something goes wrong or doesn't do what you expected, type that and it captures the whole interaction so I can fix it.

What's Actually Happening Under the Hood

The AI is a small open-source language model (7 billion parameters) running on a GPU in my server closet. No cloud, no OpenAI, no API costs — it's all local hardware. The model reads your chat message, figures out what Minecraft commands would accomplish what you asked for, and the server executes them. There's a safety layer that blocks dangerous stuff (it won't /stop the server or /op anyone, even if you ask nicely).

The interesting part: the model isn't great yet. It gets maybe 60-75% of requests right on the first try. It sometimes uses outdated command syntax, hallucinates item names that don't exist, or just doesn't understand what you want. That's where you come in.

Why I Need You

I'm building a training dataset to fine-tune the model so it actually gets good at this. Every interaction you have — every sudo command, every prayer, every bug report — gets logged as a structured training example. The more variety I get, the better the model becomes.

What I can't do is generate this data myself. I've been writing test cases for weeks and I'm out of ideas for weird things to ask. I need real people who will:

Ask for things I'd never think of
Phrase requests in ways I wouldn't
Try to confuse it, trick it, or find edge cases
Actually play the game and use it organically, not just run a test script

You don't need to do anything special. Just play Minecraft and talk to the AI when you feel like it. The logging happens automatically.

The Servers

Both are Java Edition 1.21.x, whitelisted, always up. They run different AI implementations so I'm collecting data from both.

sethpc.xyz:25567 — Paper AI Server (the full experience) Paper server with the complete AI stack. This is the main training server.

pray and sudo both work for all players
LangGraph session gateway — the AI can use tools (wiki lookups, web search) mid-conversation
FastAsyncWorldEdit for building commands
Divine interventions on a random timer — God will occasionally just... do things
Prayer memory — God remembers your previous prayers and holds grudges
Full training audit logging — every interaction is captured as structured data

sethpc.xyz:25566 — Shrink World (the challenge server) Vanilla survival with a twist: the world border shrinks every time someone dies, and creeper spawns are 5x. Hard difficulty.

pray works, sudo is admin-only here
Simpler AI implementation — no gateway, no tools, no templates
Same God persona but less capable (fewer max commands, shorter context)
Starter kit on first join
This one is more about playing the game and using pray organically when you need help surviving

The Paper server is where I need the most data, but the shrink server gives me a different kind of interaction — players praying under pressure when they're actually in trouble, not just testing.

What You Need

Minecraft Java Edition
Your username so I can whitelist you

DM me your Minecraft username and I'll add you.

If You Want to Nerd Out

The whole project is on my Gitea. The training pipeline, the evaluation harness, the bake-off results comparing different models — it's all there.

Main project (private, ask for access): https://git.sethpc.xyz/Seth/Minecraft-AI-model
- 182 training examples and counting
- Eval harness that scores models on command accuracy, syntax, safety compliance
- Live bake-off tool that runs commands on the actual server and compares results
The AI God service (private, ask for access): https://git.sethpc.xyz/Seth/minecraft-ai-god-paper-fork
- ~3800 lines of Python — log tailing, RCON execution, LLM integration, safety guardrails
- Prayer memory so God remembers what you said
- Automatic syntax repair for common model mistakes
- Divine interventions on a random timer (you'll see)
Model bake-off results (public): https://git.sethpc.xyz/Seth/small-llm-bakeoff
- Tested 7 models from 3.8B to 30B parameters on the same tasks
- The 7B model beat the 30B model on every metric
- Found a bug where Qwen models were using all their tokens thinking and returning empty answers

TL;DR

Come play Minecraft on my server. Talk to the AI. Try to break it. Every weird thing you ask it makes the model better. DM me your username.

5.4 KiB Raw Blame History