Files
Mortdecai eecebe7ef5 docs: add canonical tooling corpus (147 files) from Google/HF/frameworks
Five-lane parallel research pass. Each subdir under tooling/ has its own
README indexing downloaded files with verified upstream sources.

- google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts,
  gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev
  HTML snapshots, Gemma 3 tech report
- huggingface/: 8 gemma-4-* model cards, chat-template .jinja files,
  tokenizer_config.json, transformers gemma4/ source, launch blog posts,
  official HF Spaces app.py
- inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI
  comparison, run_commands.sh with 8 working launches, 9 code snippets
- gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2,
  Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma)
- fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE),
  TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md

Findings that update earlier CORPUS_* docs are flagged in tooling/README.md
(not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch
abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM,
FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech
report PDF yet, no Gemma-4-generation specialized siblings yet.

Pre-commit secrets hook bypassed per user authorization — flagged "secrets"
are base64 notebook cell outputs and example Ed25519 keys in the HDP
agentic-security demo, not real credentials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:24:48 -04:00

9.6 KiB
Raw Permalink Blame History

Recommended Gemma 4 Fine-Tuning Recipe (Seth's Homelab)

TL;DR

Use Unsloth. Rent a single H100 on Vast.ai. Fine-tune Gemma 4 E4B (or 31B QLoRA). Save GGUF. ollama create back to CT 105.

Why not the alternatives:

  • Your 3090 Ti(s): can handle E2B/E4B LoRA comfortably, but 26B A4B LoRA wants ~40 GB and 31B QLoRA wants 22 GB (fits, tightly). Axolotl's 5090-validated configs need Flex Attention to fit, and you lose half the throughput. An H100 at $2-3/hr for 3-4 hours is cheaper than the time you'll spend tuning memory.
  • Axolotl is great — in particular the 26B MoE ScatterMoE+expert-LoRA config is genuinely novel and Unsloth doesn't match it. But Axolotl has more moving parts (FSDP, kernels, flex attention), breaks more subtly on config errors, and the docs are less Gemma-4-specific than Unsloth's.
  • TRL has no Gemma-4-specific SFT script yet — you'd be porting sft_gemma3.py. Useful if you need DPO/GRPO or multimodal tool-call GRPO (the CARLA recipe), but heavier lift than Unsloth for plain SFT.
  • Google cookbook works and is authoritative but is slower than Unsloth (no fused kernels) and the notebook format is noisier to modify.

Exact command

On a rented H100 (Vast.ai vast-h100 alias, already configured)

ssh vast-h100
# one-time setup
pip install unsloth "trl==0.22.2" "transformers>=5.5.0" timm torchcodec

Training script (save as finetune_gemma4.py on the H100):

from unsloth import FastModel
from unsloth.chat_templates import get_chat_template, standardize_data_formats, train_on_responses_only
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

MODEL = "unsloth/gemma-4-E4B-it"     # swap to "unsloth/gemma-4-31B-it" if you want more headroom
DATASET = "YOUR_DATASET_HERE"         # e.g. a mortdecai-style chat JSONL on HF Hub

# 1. Load model + tokenizer in 4-bit
model, tokenizer = FastModel.from_pretrained(
    model_name = MODEL,
    max_seq_length = 4096,
    load_in_4bit = True,
    full_finetuning = False,
)

# 2. Attach LoRA
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers = False,   # text-only FT
    finetune_language_layers = True,
    finetune_attention_modules = True,
    finetune_mlp_modules = True,
    r = 16,
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

# 3. Chat template — "gemma-4" (literal, with dash)
tokenizer = get_chat_template(tokenizer, chat_template = "gemma-4")

# 4. Dataset: expects ShareGPT-style `conversations` field with {from, value}
#    OR OpenAI-style `messages` with {role, content} — standardize_data_formats handles both.
dataset = load_dataset(DATASET, split = "train")
dataset = standardize_data_formats(dataset)

def fmt(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(c, tokenize=False, add_generation_prompt=False)
            .removeprefix('<bos>')     # critical: avoid double <bos>
        for c in convos
    ]
    return {"text": texts}
dataset = dataset.map(fmt, batched=True)

# 5. Train
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        num_train_epochs = 1,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none",
        output_dir = "outputs",
    ),
)

# 6. Mask everything except assistant turns
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|turn>user\n",
    response_part    = "<|turn>model\n",
)

trainer.train()

# 7. Save merged 16-bit for GGUF conversion
model.save_pretrained_merged("merged_out", tokenizer, save_method = "merged_16bit")

# 8. OR save directly to GGUF (Q4_K_M) — Ollama-ready
model.save_pretrained_gguf("gemma4-mortdecai-v1", tokenizer, quantization_method = "q4_k_m")

Run:

python finetune_gemma4.py

Pulling the result back and serving on CT 105

# On the Vast box, upload to HF Hub or scp back:
scp -r vast-h100:~/gemma4-mortdecai-v1*.gguf steel141:/tmp/

# On CT 105 (pve197 Ollama):
cat > Modelfile <<'EOF'
FROM /path/to/gemma4-mortdecai-v1.Q4_K_M.gguf
PARAMETER num_ctx 8192
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64
SYSTEM "You are Mortdecai, a Minecraft ops AI. You are powered by Gemma 4."
EOF
ollama create mortdecai-gemma4:v1 -f Modelfile
ollama run mortdecai-gemma4:v1

Hardware sizing guide (from Unsloth's verified numbers)

Variant LoRA QLoRA Full FT My recommendation
E2B 8-10 GB 8 GB ~20 GB Free Colab T4; local 3090 Ti fine
E4B 17 GB 10 GB ~32 GB Local 3090 Ti (24 GB) tight but fine; H100 faster
26B A4B >40 GB (16-bit recommended, NOT 4-bit) not recommended H100 80 GB
31B dense >48 GB 22 GB 2×H100 H100 80 GB or 2×3090 Ti FSDP

For Mortdecai-style behavior tuning (matches your existing qwen-based setup), start with E4B. It's the sweet spot: larger than qwen3 8B in the things that matter (Gemma 4 E4B beats Gemma 3 27B on most benchmarks), vision-capable if you want it, and fits on a single 3090 Ti locally.

For a real coding/reasoning upgrade, use 31B QLoRA on H100. Unsloth's 31B QLoRA notebook is the canonical recipe there.

Gemma-4-specific pitfalls to NOT miss

  1. New chat template. Gemma 4 uses <|turn>user\n … <turn|> — NOT Gemma 3's <start_of_turn>user\n … <end_of_turn>. Unsloth's get_chat_template(tokenizer, chat_template="gemma-4") handles this; the HF tokenizer's built-in Jinja also handles it if you rely on apply_chat_template. Axolotl uses chat_template: gemma4 (no dash — different key).

  2. 6 new tool-calling tokens. <|tool>, <tool|>, <|tool_call>, <tool_call|>, <|tool_response>, <tool_response|>, plus the string-delimiter <|"|>. If fine-tuning on tool-call data, include full <|tool_call>call:fn_name{args}<tool_call|> in the assistant turn — no role="tool" branch exists.

  3. modules_to_save=["lm_head","embed_tokens"] + ensure_weight_tying=True in LoraConfig if going vanilla PEFT (Google's cookbook does this explicitly). The new special tokens are learned embeddings — if the embed table is frozen, the adapter sees random vectors for them and training silently underperforms. Unsloth and Axolotl bake this in.

  4. Freeze the vision/audio tower by default. Two idioms in the wild:

    • Axolotl: freeze_mm_modules: true + text-only LoRA regex.
    • HF's CARLA example: target_modules="all-linear" + exclude_modules=["vision_tower", "multi_modal_projector"]. Only train the vision tower if your task specifically needs the encoder to adapt (new image domain). For text-mode fine-tunes like Mortdecai, always freeze.
  5. Flash Attention DOES NOT WORK on Gemma 4. FA2's max head_dim=256, FA4's is 128; Gemma 4's global_head_dim=512 exceeds both. Use SDP or Flex Attention. Axolotl's configs set sdp_attention: true. TRL's sft_gemma3.py uses attn_implementation="eager" — this works but is slow; prefer "sdpa". (Unsloth's FastModel handles this automatically.)

  6. LoRA kernels OFF. Gemma 4's shared-KV-cache layers break the fused LoRA kernels. Axolotl sets lora_mlp_kernel/qkv_kernel/o_kernel: false explicitly. Unsloth's FastModel is fine because it uses its own kernel path that knows about shared-KV.

  7. Don't prepend a second <bos>. apply_chat_template adds one; SFTTrainer's collator adds one; if you don't .removeprefix('<bos>') before passing text to the trainer, you train the model to expect <bos><bos>. Unsloth's example notebooks do this strip — copy their pattern.

  8. 26B A4B: use 16-bit LoRA, not QLoRA. Unsloth's docs explicitly say "MoE QLoRA not recommended, dense 31B is fine." Axolotl has a ScatterMoE+expert-quantized+expert-LoRA config that does make 4-bit work for the MoE (validated on a 5090), but it's the only tool that does — Unsloth's 26B A4B notebook goes 16-bit for quality.

  9. Initial training loss of 13-15 on E2B/E4B is normal, not a bug. Multimodal models start much higher than 5-8. If you see 13-15 don't panic — GOTCHAS.md §"Fine-Tuning Ecosystem Issues" has this.

  10. mm_token_type_ids required during training even for text-only data. Day-one PEFT/Transformers bug: the multimodal collator requires this field. Pin transformers>=5.5.0 and peft>=0.15 to ensure the fix is present.

Feature parity snapshot (2026-04-18)

Feature Unsloth TRL Axolotl Google cookbook
Text SFT ~ (via gemma3 script, change model_id)
Vision SFT ~ (via sft_vlm_gemma3) ✓ (E2B)
Audio SFT ✓ (E2B/E4B)
GRPO ✓ (E2B + RL game notebooks) ✓ (CARLA VLM-GRPO, official)
DPO via TRL
26B MoE native ✓ (16-bit LoRA) ~ ✓ (ScatterMoE + expert-LoRA, validated on 5090)
31B dense QLoRA ~ ✓ (with Flex Attn) ~
Free Colab T4 path ✓ (E2B) ~ (via Colab Pro)
Multi-GPU FSDP ~ ✓ (first-class) ~

Bottom line: Unsloth has the broadest Gemma-4-native coverage (including audio and RL games, which no one else has). Axolotl has the best 26B MoE story. TRL has the best multimodal-RL story (CARLA). Google cookbook is the reference, not the fast path.

For Seth's stated use case (fine-tune like mortdecai), Unsloth wins on ergonomics + speed + T4 free-tier fallback.