Bake-off tested 7 models on 31 seed examples via GPU-accelerated Ollama on node-197 RTX 4000. gemma3n:e4b leads for serving (80.6% cmd match, 100% safety, 5.9s). qwen3:8b recommended as fine-tuning base (Apache 2.0, best syntax quality, strong ecosystem). Full research in MODEL_RESEARCH.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
12 KiB
Model Research: Small LMs for LoRA/QLoRA Fine-Tuning
Date: 2026-03-18 Purpose: Evaluate small language models (4-14B) as base models for the Minecraft server ops assistant. Constraints:
- 8GB VRAM for inference (Q4 quantized via Ollama)
- 24GB VRAM for training (QLoRA)
- Permissive license (Apache 2.0, MIT -- NOT community/restricted licenses)
- Available on both Ollama (serving) and HuggingFace in safetensors/PyTorch (training)
- Good instruction following and structured JSON output
- Active fine-tuning ecosystem (Unsloth, Axolotl, PEFT, LlamaFactory)
Ranked Recommendations
1. Qwen3-8B (RECOMMENDED)
| Attribute | Detail |
|---|---|
| Parameters | 8B dense |
| Release | April 2025 |
| License | Apache 2.0 |
| HuggingFace | Qwen/Qwen3-8B -- safetensors, BF16 |
| Ollama | ollama pull qwen3:8b |
| Q4 VRAM | ~5.5 GB (fits 8GB comfortably) |
| QLoRA VRAM | ~14-16 GB (fits 24GB easily) |
| Context | 128K native |
Why #1:
- Outperforms Qwen2.5-14B on benchmarks despite being smaller. MMLU-Redux ~87, MATH-500 ~98.
- Apache 2.0 with no usage restrictions -- the cleanest license in this list.
- First-class Unsloth support with dedicated notebooks and 2x training speedup.
- Supported by Axolotl, LlamaFactory, PEFT, and TRL out of the box.
- Native thinking/non-thinking mode toggle -- useful for complex command generation vs. quick lookups.
- Strong structured output support; JSON format instructions work reliably.
- Massive community: most fine-tuned derivatives on HuggingFace of any model this size.
Caveats:
- Newer than some alternatives, so fewer battle-tested fine-tunes in production.
2. Qwen3.5-4B
| Attribute | Detail |
|---|---|
| Parameters | 4B dense |
| Release | February 2026 |
| License | Apache 2.0 |
| HuggingFace | Qwen/Qwen3.5-4B -- safetensors, BF16/F32 |
| Ollama | ollama pull qwen3.5:4b (~3.4 GB) |
| Q4 VRAM | ~2.5-3 GB |
| QLoRA VRAM | ~8-10 GB |
| Context | 256K native |
Why #2:
- The newest model on this list (Feb 2026) with latest training techniques.
- Extremely lightweight -- leaves massive headroom for context on 8GB cards.
- 256K context window is best-in-class for this parameter range.
- Full Unsloth + LlamaFactory support confirmed.
- Apache 2.0 license, no restrictions.
- Ideal if your training data is small (<1000 examples) -- smaller models fine-tune faster and can still match larger models on narrow domains.
Caveats:
- 4B may struggle with complex multi-step reasoning compared to 8B.
- Fewer community fine-tunes available yet (very new release).
3. Qwen3-4B
| Attribute | Detail |
|---|---|
| Parameters | 4B dense (36-layer transformer) |
| Release | April 2025 |
| License | Apache 2.0 |
| HuggingFace | Qwen/Qwen3-4B -- safetensors |
| Ollama | ollama pull qwen3:4b |
| Q4 VRAM | ~2.5 GB |
| QLoRA VRAM | ~8-10 GB |
| Context | 128K native |
Why #3:
- Benchmarks rival Qwen2.5-72B-Instruct (!!) according to Qwen team claims.
- MMLU-Redux 83.7, MATH-500 97.0 -- exceptional for 4B.
- Well-established Unsloth support with notebooks and GGUF export pipeline.
- Best fine-tuning benchmark results per distillabs.ai evaluation: "Qwen3-4B-Instruct-2507 delivers the best overall fine-tuned performance, matching a 120B+ teacher."
- Apache 2.0.
Caveats:
- Slightly older than Qwen3.5-4B; same parameter count but older architecture.
4. Phi-4-mini-instruct (3.8B)
| Attribute | Detail |
|---|---|
| Parameters | 3.8B |
| Release | February 2025 |
| License | MIT |
| HuggingFace | microsoft/Phi-4-mini-instruct -- safetensors |
| Ollama | ollama pull phi4-mini:3.8b |
| Q4 VRAM | ~2.5 GB |
| QLoRA VRAM | ~8-10 GB |
| Context | 128K |
Why #4:
- MIT license -- the most permissive option available.
- Microsoft provides an official LoRA fine-tuning script in the HuggingFace repo.
- Performance comparable to 7-9B models (Llama-3.1-8B level) despite being 3.8B.
- 200K vocabulary, grouped-query attention -- modern architecture.
- JSON tool-calling format built into the chat template.
- Unsloth support confirmed with dedicated notebooks.
Caveats:
- Smaller community of fine-tuners compared to Qwen.
- 3.8B is the smallest viable option; may need more training data to match larger models on nuanced tasks.
- Microsoft's Phi models have historically had some quirks with non-English content and repetition.
5. Gemma 3 4B-IT
| Attribute | Detail |
|---|---|
| Parameters | 4B (multimodal -- text + image) |
| Release | March 2025 |
| License | Gemma Terms of Use (NOT Apache 2.0 -- see caveats) |
| HuggingFace | google/gemma-3-4b-it -- safetensors |
| Ollama | ollama pull gemma3:4b (~3.3 GB) |
| Q4 VRAM | ~2.5 GB |
| QLoRA VRAM | ~8-10 GB |
| Context | 128K |
Why #5:
- Outperforms Gemma 2 27B on benchmarks -- a 7x smaller model beating its predecessor's flagship.
- Google provides official LoRA fine-tuning docs with Keras and HuggingFace PEFT.
- QAT (Quantization-Aware Training) variants available for better quantized performance.
- Native function calling and structured output support.
- Multimodal capability (text + images) could be useful for screenshot-based troubleshooting.
- Unsloth, Axolotl, and LlamaFactory all support Gemma 3.
Caveats:
- License is NOT Apache 2.0. Gemma Terms of Use allow commercial use but include a Prohibited Use Policy covering sensitive domains. Google retains the right to "restrict (remotely or otherwise) usage." This is more restrictive than Apache 2.0/MIT.
- For a personal Minecraft server project this is likely fine, but it fails the strict "permissive license" requirement.
6. Gemma 3 12B-IT
| Attribute | Detail |
|---|---|
| Parameters | 12B (multimodal) |
| Release | March 2025 |
| License | Gemma Terms of Use (same caveats as 4B) |
| HuggingFace | google/gemma-3-12b-it -- safetensors |
| Ollama | ollama pull gemma3:12b |
| Q4 VRAM | ~6.6 GB (Google claims RTX 4060 8GB works) |
| QLoRA VRAM | ~18-20 GB (fits 24GB) |
| Context | 128K |
Why #6:
- The largest model that can fit in 8GB VRAM at Q4.
- Best raw capability of any model on this list.
- QAT Q4 variants from Google specifically optimized for consumer GPUs.
- Full Unsloth support.
Caveats:
- Tight fit on 8GB -- leaves little headroom for KV cache with long prompts.
- Same license concerns as Gemma 3 4B.
- QLoRA training at 12B needs more VRAM; will use ~18-20 GB of your 24GB budget.
7. Mistral NeMo 12B
| Attribute | Detail |
|---|---|
| Parameters | 12B |
| Release | July 2024 |
| License | Apache 2.0 |
| HuggingFace | mistralai/Mistral-Nemo-Instruct-2407 -- safetensors |
| Ollama | ollama pull mistral-nemo:12b |
| Q4 VRAM | ~7 GB |
| QLoRA VRAM | ~18-22 GB (higher due to large vocabulary) |
| Context | 128K |
Why #7:
- Apache 2.0 license, built with NVIDIA collaboration.
- 128K context, strong multilingual support.
- Established fine-tuning ecosystem with mistral-finetune tool.
Caveats:
- Oldest model on this list (July 2024) -- outperformed by newer 4-8B models on many benchmarks.
- Large vocabulary (32K+ tokens) increases memory requirements for fine-tuning beyond what the parameter count suggests.
- Tight fit on 8GB VRAM at Q4 with limited context headroom.
- Not recommended over Qwen3-8B which is newer, smaller, and benchmarks better.
Models Considered and Rejected
| Model | Reason for Rejection |
|---|---|
| Llama 3.2 (1B/3B) | Llama Community License prohibits using outputs to train non-Llama models. Distillation restrictions. Not truly permissive. |
| Llama 3.1-8B / 3.3-70B | Same license restrictions as above. The 700M MAU clause and output training restrictions disqualify it. |
| Qwen3-Coder (30B-A3B, 480B) | All variants are massive MoE models. Even the smallest (30B-A3B with 3B active) has 30B total parameters -- too large for 8GB inference and questionable for 24GB QLoRA. |
| Mistral Small 3 (24B) | 24B parameters -- requires ~14 GB VRAM at Q4. Does not fit 8GB. |
| Phi-4 (14B) | Fits 8GB at Q4 (~8-9 GB) only marginally. QLoRA at 14B needs ~22-24 GB, cutting it very close. The 3.8B Phi-4-mini is a better fit for this project. |
| Gemma 2 (9B/27B) | Superseded by Gemma 3. No reason to use older generation. |
| Qwen2.5 (7B/14B) | Superseded by Qwen3 and Qwen3.5 with significantly better benchmarks. |
Fine-Tuning Ecosystem Comparison (as of March 2026)
| Framework | Qwen3/3.5 | Phi-4-mini | Gemma 3 | Mistral NeMo |
|---|---|---|---|---|
| Unsloth | Full support, dedicated notebooks, 2x speedup | Supported, notebooks available | Supported, Gemma 3n confirmed | Supported |
| Axolotl | Supported | Supported | Supported | Supported |
| LlamaFactory | Supported, Ollama export | Supported | Supported | Supported |
| HF PEFT/TRL | Supported | Supported, official script | Supported, Google official docs | Supported |
| Community notebooks | Abundant | Moderate | Abundant | Moderate |
Recommendation for This Project
Primary: Qwen3-8B -- Best balance of capability, VRAM fit, license cleanliness, and fine-tuning ecosystem. It significantly outperforms older 14B models while fitting comfortably in 8GB at Q4. Apache 2.0 means zero legal concerns.
Secondary: Qwen3-4B or Qwen3.5-4B -- If training data is limited (<500 examples) or you want faster iteration cycles, a 4B model will fine-tune faster and still perform well on the narrow domain of Minecraft server operations. Qwen3.5-4B is newer with a 256K context window; Qwen3-4B has more proven fine-tuning results.
Note on qwen3-coder: The current PLAN.md references qwen3-coder as the base model. All Qwen3-Coder variants are large MoE models (30B+ total parameters) that do not fit the 8GB inference constraint. The recommendation is to use Qwen3-8B (or Qwen3-4B) as the base model instead. The coding/command-generation capability can be developed through fine-tuning on domain-specific data rather than requiring a code-specialized base model.
Sources
- Qwen3 announcement and benchmarks
- Qwen3.5 on HuggingFace
- Qwen3.5 on Ollama
- Phi-4-mini-instruct on HuggingFace
- Phi-4-mini on Ollama
- Gemma 3 on Ollama
- Gemma 3 QAT models for consumer GPUs
- Gemma Terms of Use
- Gemma license risk analysis
- Mistral NeMo on HuggingFace
- Mistral NeMo on Ollama
- Unsloth model catalog
- Unsloth Qwen3 fine-tuning guide
- Unsloth Qwen3.5 fine-tuning guide
- Unsloth Phi-4 fine-tuning
- Unsloth Gemma 3 fine-tuning
- Fine-tuning framework comparison 2026
- Distillabs SLM fine-tuning benchmark
- JSONSchemaBench structured output benchmark
- Llama license restrictions analysis
- Qwen3-Coder on HuggingFace
- Top SLMs 2026 overview (DataCamp)
- Best open-source SLMs 2026 (BentoML)