Small LLM Bake-Off: 7 models, 1 GPU, 31 tasks
Tested gemma3n:e4b, qwen3-coder:30b, phi4-mini, qwen3:8b, qwen3.5:9b, qwen3.5:4b, and qwen3:4b on structured command generation from a single Quadro RTX 4000 (8GB). The 6.9B model beat the 30B model on every metric. Includes the test harness, evaluation dataset, raw results from all rounds, and a writeup covering the token budget discovery that doubled one model's score overnight. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,3 @@
|
||||
__pycache__/
|
||||
*.pyc
|
||||
.DS_Store
|
||||
Reference in New Issue
Block a user