docs: add canonical tooling corpus (147 files) from Google/HF/frameworks

Five-lane parallel research pass. Each subdir under tooling/ has its own
README indexing downloaded files with verified upstream sources.

- google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts,
  gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev
  HTML snapshots, Gemma 3 tech report
- huggingface/: 8 gemma-4-* model cards, chat-template .jinja files,
  tokenizer_config.json, transformers gemma4/ source, launch blog posts,
  official HF Spaces app.py
- inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI
  comparison, run_commands.sh with 8 working launches, 9 code snippets
- gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2,
  Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma)
- fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE),
  TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md

Findings that update earlier CORPUS_* docs are flagged in tooling/README.md
(not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch
abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM,
FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech
report PDF yet, no Gemma-4-generation specialized siblings yet.

Pre-commit secrets hook bypassed per user authorization — flagged "secrets"
are base64 notebook cell outputs and example Ed25519 keys in the HDP
agentic-security demo, not real credentials.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Mortdecai
2026-04-18 12:24:48 -04:00
parent 5011059f5d
commit eecebe7ef5
149 changed files with 181297 additions and 0 deletions
@@ -0,0 +1,250 @@
<h1 align="center" style="margin:0;">
<a href="https://unsloth.ai/docs"><picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/unslothai/unsloth/main/images/STUDIO%20WHITE%20LOGO.png">
<source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/unslothai/unsloth/main/images/STUDIO%20BLACK%20LOGO.png">
<img alt="Unsloth logo" src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/STUDIO%20BLACK%20LOGO.png" height="60" style="max-width:100%;">
</picture></a>
</h1>
<h3 align="center" style="margin: 0; margin-top: 0;">
Run and train AI models with a unified local interface.
</h3>
<p align="center">
<a href="#-features">Features</a> •
<a href="#-quickstart">Quickstart</a> •
<a href="#-free-notebooks">Notebooks</a> •
<a href="https://unsloth.ai/docs">Documentation</a> •
<a href="https://www.reddit.com/r/unsloth/">Reddit</a>
</p>
<a href="https://unsloth.ai/docs/new/studio">
<img alt="unsloth studio ui homepage" src="https://raw.githubusercontent.com/unslothai/unsloth/main/studio/frontend/public/studio%20github%20landscape%20colab%20display.png" style="max-width: 100%; margin-bottom: 0;"></a>
Unsloth Studio (Beta) lets you run and train text, [audio](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning), [embedding](https://unsloth.ai/docs/new/embedding-finetuning), [vision](https://unsloth.ai/docs/basics/vision-fine-tuning) models on Windows, Linux and macOS.
## ⭐ Features
Unsloth provides several key features for both inference and training:
### Inference
* **Search + download + run models** including GGUF, LoRA adapters, safetensors
* **Export models**: [Save or export](https://unsloth.ai/docs/new/studio/export) models to GGUF, 16-bit safetensors and other formats.
* **Tool calling**: Support for [self-healing tool calling](https://unsloth.ai/docs/new/studio/chat#auto-healing-tool-calling) and web search
* **[Code execution](https://unsloth.ai/docs/new/studio/chat#code-execution)**: lets LLMs test code in Claude artifacts and sandbox environments
* [Auto-tune inference parameters](https://unsloth.ai/docs/new/studio/chat#auto-parameter-tuning) and customize chat templates.
* We work directly with teams behind [gpt-oss](https://docs.unsloth.ai/new/gpt-oss-how-to-run-and-fine-tune#unsloth-fixes-for-gpt-oss), [Qwen3](https://www.reddit.com/r/LocalLLaMA/comments/1kaodxu/qwen3_unsloth_dynamic_ggufs_128k_context_bug_fixes/), [Llama 4](https://github.com/ggml-org/llama.cpp/pull/12889), [Mistral](models/tutorials/devstral-how-to-run-and-fine-tune.md), [Gemma 1-3](https://news.ycombinator.com/item?id=39671146), and [Phi-4](https://unsloth.ai/blog/phi4), where weve fixed bugs that improve model accuracy.
* Upload images, audio, PDFs, code, DOCX and more file types to chat with.
### Training
* Train and RL **500+ models** up to **2x faster** with up to **70% less VRAM**, with no accuracy loss.
* Custom Triton and mathematical **kernels**. See some collabs we did with [PyTorch](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) and [Hugging Face](https://unsloth.ai/docs/new/faster-moe).
* **Data Recipes**: [Auto-create datasets](https://unsloth.ai/docs/new/studio/data-recipe) from **PDF, CSV, DOCX** etc. Edit data in a visual-node workflow.
* **[Reinforcement Learning](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide)** (RL): The most efficient [RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) library, using **80% less VRAM** for GRPO, [FP8](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) etc.
* Supports full fine-tuning, RL, pretraining, 4-bit, 16-bit and, FP8 training.
* **Observability**: Monitor training live, track loss and GPU usage and customize graphs.
* [Multi-GPU](https://unsloth.ai/docs/basics/multi-gpu-training-with-unsloth) training is supported, with major improvements coming soon.
## ⚡ Quickstart
Unsloth can be used in two ways: through **[Unsloth Studio](https://unsloth.ai/docs/new/studio/)**, the web UI, or through **Unsloth Core**, the code-based version. Each has different requirements.
### Unsloth Studio (web UI)
Unsloth Studio (Beta) works on **Windows, Linux, WSL** and **macOS**.
* **CPU:** Supported for Chat and Data Recipes currently
* **NVIDIA:** Training works on RTX 30/40/50, Blackwell, DGX Spark, Station and more
* **macOS:** Currently supports chat and Data Recipes. **MLX training** is coming very soon
* **AMD:** Chat + Data works. Train with [Unsloth Core](#unsloth-core-code-based). Studio support is out soon.
* **Coming soon:** Training support for Apple MLX, AMD, and Intel.
* **Multi-GPU:** Available now, with a major upgrade on the way
#### macOS, Linux, WSL:
```bash
curl -fsSL https://unsloth.ai/install.sh | sh
```
#### Windows:
```powershell
irm https://unsloth.ai/install.ps1 | iex
```
#### Launch
```bash
unsloth studio -H 0.0.0.0 -p 8888
```
#### Update
To update, use the same install commands as above. Or run (does not work on Windows):
```bash
unsloth studio update
```
#### Docker
Use our [Docker image](https://hub.docker.com/r/unsloth/unsloth) ```unsloth/unsloth``` container. Run:
```bash
docker run -d -e JUPYTER_PASSWORD="mypassword" \
-p 8888:8888 -p 8000:8000 -p 2222:22 \
-v $(pwd)/work:/workspace/work \
--gpus all \
unsloth/unsloth
```
#### Developer, Nightly, Uninstall
To see developer, nightly and uninstallation etc. instructions, see [advanced installation](#-advanced-installation).
### Unsloth Core (code-based)
#### Linux, WSL:
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv unsloth_env --python 3.13
source unsloth_env/bin/activate
uv pip install unsloth --torch-backend=auto
```
#### Windows:
```powershell
winget install -e --id Python.Python.3.13
winget install --id=astral-sh.uv -e
uv venv unsloth_env --python 3.13
.\unsloth_env\Scripts\activate
uv pip install unsloth --torch-backend=auto
```
For Windows, `pip install unsloth` works only if you have PyTorch installed. Read our [Windows Guide](https://unsloth.ai/docs/get-started/install/windows-installation).
You can use the same Docker image as Unsloth Studio.
#### AMD, Intel:
For RTX 50x, B200, 6000 GPUs: `uv pip install unsloth --torch-backend=auto`. Read our guides for: [Blackwell](https://unsloth.ai/docs/blog/fine-tuning-llms-with-blackwell-rtx-50-series-and-unsloth) and [DGX Spark](https://unsloth.ai/docs/blog/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth). <br>
To install Unsloth on **AMD** and **Intel** GPUs, follow our [AMD Guide](https://unsloth.ai/docs/get-started/install/amd) and [Intel Guide](https://unsloth.ai/docs/get-started/install/intel).
## 📒 Free Notebooks
Train for free with our notebooks. You can use our new [free Unsloth Studio notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb) to run and train models for free in a web UI.
Read our [guide](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide). Add dataset, run, then deploy your trained model.
| Model | Free Notebooks | Performance | Memory use |
|-----------|---------|--------|----------|
| **Gemma 4 (E2B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_(E2B)-Vision.ipynb) | 1.5x faster | 50% less |
| **Qwen3.5 (4B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_5_(4B)_Vision.ipynb) | 1.5x faster | 60% less |
| **gpt-oss (20B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-Fine-tuning.ipynb) | 2x faster | 70% less |
| **Qwen3.5 GSPO** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_5_(4B)_Vision_GRPO.ipynb) | 2x faster | 70% less |
| **gpt-oss (20B): GRPO** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) | 2x faster | 80% less |
| **Qwen3: Advanced GRPO** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb) | 2x faster | 70% less |
| **embeddinggemma (300M)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/EmbeddingGemma_(300M).ipynb) | 2x faster | 20% less |
| **Mistral Ministral 3 (3B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Ministral_3_VL_(3B)_Vision.ipynb) | 1.5x faster | 60% less |
| **Llama 3.1 (8B) Alpaca** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb) | 2x faster | 70% less |
| **Llama 3.2 Conversational** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) | 2x faster | 70% less |
| **Orpheus-TTS (3B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Orpheus_(3B)-TTS.ipynb) | 1.5x faster | 50% less |
- See all our notebooks for: [Kaggle](https://github.com/unslothai/notebooks?tab=readme-ov-file#-kaggle-notebooks), [GRPO](https://unsloth.ai/docs/get-started/unsloth-notebooks#grpo-reasoning-rl-notebooks), [TTS](https://unsloth.ai/docs/get-started/unsloth-notebooks#text-to-speech-tts-notebooks), [embedding](https://unsloth.ai/docs/new/embedding-finetuning) & [Vision](https://unsloth.ai/docs/get-started/unsloth-notebooks#vision-multimodal-notebooks)
- See [all our models](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [all our notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks)
- See detailed documentation for Unsloth [here](https://unsloth.ai/docs)
## 🦥 Unsloth News
- **Gemma 4**: Run and train Googles new models directly in Unsloth Studio! [Blog](https://unsloth.ai/docs/models/gemma-4)
- **Introducing Unsloth Studio**: our new web UI for running and training LLMs. [Blog](https://unsloth.ai/docs/new/studio)
- **Qwen3.5** - 0.8B, 2B, 4B, 9B, 27B, 35-A3B, 112B-A10B are now supported. [Guide + notebooks](https://unsloth.ai/docs/models/qwen3.5/fine-tune)
- Train **MoE LLMs 12x faster** with 35% less VRAM - DeepSeek, GLM, Qwen and gpt-oss. [Blog](https://unsloth.ai/docs/new/faster-moe)
- **Embedding models**: Unsloth now supports ~1.8-3.3x faster embedding fine-tuning. [Blog](https://unsloth.ai/docs/new/embedding-finetuning) • [Notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks#embedding-models)
- New **7x longer context RL** vs. all other setups, via our new batching algorithms. [Blog](https://unsloth.ai/docs/new/grpo-long-context)
- New RoPE & MLP **Triton Kernels** & **Padding Free + Packing**: 3x faster training & 30% less VRAM. [Blog](https://unsloth.ai/docs/new/3x-faster-training-packing)
- **500K Context**: Training a 20B model with >500K context is now possible on an 80GB GPU. [Blog](https://unsloth.ai/docs/blog/500k-context-length-fine-tuning)
- **FP8 & Vision RL**: You can now do FP8 & VLM GRPO on consumer GPUs. [FP8 Blog](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/vision-reinforcement-learning-vlm-rl)
- **gpt-oss** by OpenAI: Read our [RL blog](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/gpt-oss-reinforcement-learning), [Flex Attention](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/long-context-gpt-oss-training) blog and [Guide](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune).
## 📥 Advanced Installation
The below advanced instructions are for Unsloth Studio. For Unsloth Core advanced installation, [view our docs](https://unsloth.ai/docs/get-started/install/pip-install#advanced-pip-installation).
#### Developer installs: macOS, Linux, WSL:
```bash
git clone https://github.com/unslothai/unsloth
cd unsloth
./install.sh --local
unsloth studio -H 0.0.0.0 -p 8888
```
Then to update :
```bash
unsloth studio update
```
#### Developer installs: Windows PowerShell:
```powershell
git clone https://github.com/unslothai/unsloth.git
cd unsloth
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
.\install.ps1 --local
unsloth studio -H 0.0.0.0 -p 8888
```
Then to update :
```bash
unsloth studio update
```
#### Nightly: MacOS, Linux, WSL:
```bash
git clone https://github.com/unslothai/unsloth
cd unsloth
git checkout nightly
./install.sh --local
unsloth studio -H 0.0.0.0 -p 8888
```
Then to launch every time:
```bash
unsloth studio -H 0.0.0.0 -p 8888
```
#### Nightly: Windows:
Run in Windows Powershell:
```bash
git clone https://github.com/unslothai/unsloth.git
cd unsloth
git checkout nightly
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
.\install.ps1 --local
unsloth studio -H 0.0.0.0 -p 8888
```
Then to launch every time:
```bash
unsloth studio -H 0.0.0.0 -p 8888
```
#### Uninstall
You can uninstall Unsloth Studio by deleting its install folder usually located under `$HOME/.unsloth/studio` on Mac/Linux/WSL and `%USERPROFILE%\.unsloth\studio` on Windows. Using the `rm -rf` commands will **delete everything**, including your history, cache:
* **MacOS, WSL, Linux:** `rm -rf ~/.unsloth/studio`
* **Windows (PowerShell):** `Remove-Item -Recurse -Force "$HOME\.unsloth\studio"`
For more info, [see our docs](https://unsloth.ai/docs/new/studio/install#uninstall).
#### Deleting model files
You can delete old model files either from the bin icon in model search or by removing the relevant cached model folder from the default Hugging Face cache directory. By default, HF uses:
* **MacOS, Linux, WSL:** `~/.cache/huggingface/hub/`
* **Windows:** `%USERPROFILE%\.cache\huggingface\hub\`
## 💚 Community and Links
| Type | Links |
| ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
| <img width="16" src="https://cdn.prod.website-files.com/6257adef93867e50d84d30e2/66e3d80db9971f10a9757c99_Symbol.svg" />  **Discord** | [Join Discord server](https://discord.com/invite/unsloth) |
| <img width="15" src="https://redditinc.com/hs-fs/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" />  **r/unsloth Reddit** | [Join Reddit community](https://reddit.com/r/unsloth) |
| 📚 **Documentation & Wiki** | [Read Our Docs](https://unsloth.ai/docs) |
| <img width="13" src="https://upload.wikimedia.org/wikipedia/commons/0/09/X_(formerly_Twitter)_logo_late_2025.svg" />  **Twitter (aka X)** | [Follow us on X](https://twitter.com/unslothai) |
| 🔮 **Our Models** | [Unsloth Catalog](https://unsloth.ai/docs/get-started/unsloth-model-catalog) |
| ✍️ **Blog** | [Read our Blogs](https://unsloth.ai/blog) |
### Citation
You can cite the Unsloth repo as follows:
```bibtex
@software{unsloth,
author = {Daniel Han, Michael Han and Unsloth team},
title = {Unsloth},
url = {https://github.com/unslothai/unsloth},
year = {2023}
}
```
If you trained a model with 🦥Unsloth, you can use this cool sticker!   <img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/made with unsloth.png" width="200" align="center" />
### License
Unsloth uses a dual-licensing model of Apache 2.0 and AGPL-3.0. The core Unsloth package remains licensed under **[Apache 2.0](https://github.com/unslothai/unsloth?tab=Apache-2.0-1-ov-file)**, while certain optional components, such as the Unsloth Studio UI are licensed under the open-source license **[AGPL-3.0](https://github.com/unslothai/unsloth?tab=AGPL-3.0-2-ov-file)**.
This structure helps support ongoing Unsloth development while keeping the project open source and enabling the broader ecosystem to continue growing.
### Thank You to
- The [llama.cpp library](https://github.com/ggml-org/llama.cpp) that lets users run and save models with Unsloth
- The Hugging Face team and their libraries: [transformers](https://github.com/huggingface/transformers) and [TRL](https://github.com/huggingface/trl)
- The Pytorch and [Torch AO](https://github.com/unslothai/unsloth/pull/3391) team for their contributions
- NVIDIA for their [NeMo DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) library and their contributions
- And of course for every single person who has contributed or has used Unsloth!
File diff suppressed because it is too large Load Diff
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large Load Diff
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large Load Diff
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large Load Diff
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@@ -0,0 +1,512 @@
#!/usr/bin/env python
# coding: utf-8
# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
# <div class="align-center">
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
# </div>
#
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
#
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
# ### News
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
#
# <table><tr>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
# </tr></table>
#
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
#
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
#
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
#
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
# # ### Installation
#
# # In[1]:
#
#
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
#
#
# # In[2]:
#
#
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
#
#
# # ### Unsloth
#
# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
# In[3]:
from unsloth import FastModel
import torch
gemma4_models = [
# Gemma-4 instruct models:
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E4B-it",
"unsloth/gemma-4-31B-it",
"unsloth/gemma-4-26B-A4B-it",
# Gemma-4 base models:
"unsloth/gemma-4-E2B",
"unsloth/gemma-4-E4B",
"unsloth/gemma-4-31B",
"unsloth/gemma-4-26B-A4B",
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastModel.from_pretrained(
model_name = "unsloth/gemma-4-26B-A4B-it",
dtype = None, # None for auto detection
max_seq_length = 8192, # Choose any for long context!
load_in_4bit = True, # 4 bit quantization to reduce memory
full_finetuning = False, # [NEW!] We have full finetuning now!
# token = "YOUR_HF_TOKEN", # HF Token for gated models
)
# # Gemma 4 can process Text, Vision and Audio!
#
# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
# In[4]:
from transformers import TextStreamer
# Helper function for inference
def do_gemma_4_inference(messages, max_new_tokens = 128):
_ = model.generate(
**tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
tokenize = True,
return_dict = True,
return_tensors = "pt",
).to("cuda"),
max_new_tokens = max_new_tokens,
use_cache = True,
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(tokenizer, skip_prompt = True),
)
# # Gemma 4 can see images!
#
# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
# In[5]:
sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
messages = [{
"role" : "user",
"content": [
{ "type": "image", "image" : sloth_link },
{ "type": "text", "text" : "Which films does this animal feature in?" }
]
}]
# You might have to wait 1 minute for Unsloth's auto compiler
do_gemma_4_inference(messages, max_new_tokens = 256)
# Let's make a poem about sloths!
# In[6]:
messages = [{
"role": "user",
"content": [{ "type" : "text",
"text" : "Write a poem about sloths." }]
}]
do_gemma_4_inference(messages)
# # Let's finetune Gemma 4!
#
# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
# We now add LoRA adapters so we only need to update a small amount of parameters!
# In[7]:
model = FastModel.get_peft_model(
model,
finetune_vision_layers = False, # Turn off for just text!
finetune_language_layers = True, # Should leave on!
finetune_attention_modules = True, # Attention good for GRPO
finetune_mlp_modules = True, # Should leave on always!
r = 8, # Larger = higher accuracy, but might overfit
lora_alpha = 8, # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
)
# <a name="Data"></a>
# ### Data Prep
# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
#
# ```
# <bos><|turn>user
# Hello<turn|>
# <|turn>model
# Hey there!<turn|>
# ```
# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
# In[8]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "gemma-4-thinking",
)
# We get the first 3000 rows of the dataset
# In[9]:
from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
# In[10]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)
# Let's see how row 100 looks like!
# In[11]:
dataset[100]
# We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
# In[12]:
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
return { "text" : texts, }
dataset = dataset.map(formatting_prompts_func, batched = True)
# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
# In[13]:
dataset[100]["text"]
# <a name="Train"></a>
# ### Train the model
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
# In[14]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
eval_dataset = None, # Can set up evaluation!
args = SFTConfig(
dataset_text_field = "text",
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4, # Use GA to mimic batch size!
warmup_steps = 5,
# num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 60,
learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.001,
lr_scheduler_type = "linear",
seed = 3407,
report_to = "none", # Use TrackIO/WandB etc
),
)
# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
# In[15]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part = "<|turn>user\n",
response_part = "<|turn>model\n",
)
# Let's verify masking the instruction part is done! Let's print the 100th row again. Notice how the sample only has a single `<bos>` as expected!
# In[16]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])
# Now let's print the masked out example - you should see only the answer is present:
# In[17]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
# In[18]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
# # Let's train the model!
#
# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
# In[19]:
trainer_stats = trainer.train()
# In[20]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
# <a name="Inference"></a>
# ### Inference
# Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
# In[21]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "gemma-4-thinking",
)
messages = [{
"role": "user",
"content": [{
"type" : "text",
"text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
}]
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens = 64, # Increase for longer outputs!
use_cache = True,
# Recommended Gemma-3 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)
# You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
# In[22]:
messages = [{
"role": "user",
"content": [{"type" : "text", "text" : "Why is the sky blue?",}]
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
from transformers import TextStreamer
_ = model.generate(
**inputs,
max_new_tokens = 64, # Increase for longer outputs!
use_cache = True,
# Recommended Gemma-3 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(tokenizer, skip_prompt = True),
)
# <a name="Save"></a>
# ### Saving, loading finetuned models
# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
#
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
# In[23]:
model.save_pretrained("gemma_4_lora") # Local saving
tokenizer.save_pretrained("gemma_4_lora")
# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
# In[24]:
if False:
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = 2048,
load_in_4bit = True,
)
messages = [{
"role": "user",
"content": [{"type" : "text", "text" : "What is Gemma-4?",}]
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
from transformers import TextStreamer
_ = model.generate(
**inputs,
max_new_tokens = 128, # Increase for longer outputs!
# Recommended Gemma-3 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(tokenizer, skip_prompt = True),
)
# ### Saving to float16 for VLLM
#
# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
# In[25]:
if False: # Change to True to save finetune!
model.save_pretrained_merged("gemma-4-finetune", tokenizer)
# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
# In[26]:
if False: # Change to True to upload finetune
model.push_to_hub_merged(
"HF_ACCOUNT/gemma-4-finetune", tokenizer,
token = "YOUR_HF_TOKEN"
)
# ### GGUF / llama.cpp Conversion
# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
# In[27]:
if False: # Change to True to save to GGUF
model.save_pretrained_gguf(
"gemma_4_finetune",
tokenizer,
quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
)
# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
# In[28]:
if False: # Change to True to upload GGUF
model.push_to_hub_gguf(
"HF_ACCOUNT/gemma_4_finetune",
tokenizer,
quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
token = "YOUR_HF_TOKEN",
)
# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
#
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
#
# Some other resources:
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
#
# <div class="align-center">
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
#
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
# </div>
#
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,448 @@
#!/usr/bin/env python
# coding: utf-8
# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
# <div class="align-center">
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
# </div>
#
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
#
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
# ### News
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
#
# <table><tr>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
# </tr></table>
#
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
#
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
#
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
#
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
# # ### Installation
#
# # In[1]:
#
#
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
#
#
# # In[2]:
#
#
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
#
#
# # ### Unsloth
# In[3]:
from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch
gemma4_models = [
# Gemma-4 instruct models:
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E4B-it",
"unsloth/gemma-4-31B-it",
"unsloth/gemma-4-26B-A4B-it",
# Gemma-4 base models:
"unsloth/gemma-4-E2B",
"unsloth/gemma-4-E4B",
"unsloth/gemma-4-31B",
"unsloth/gemma-4-26B-A4B",
] # More models at https://huggingface.co/unsloth
model, processor = FastVisionModel.from_pretrained(
"unsloth/gemma-4-26B-A4B-it",
load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)
# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
#
# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
# In[4]:
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers = True, # False if not finetuning vision layers
finetune_language_layers = True, # False if not finetuning language layers
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules = True, # False if not finetuning MLP layers
r = 32, # The larger, the higher the accuracy, but might overfit
lora_alpha = 32, # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
target_modules = "all-linear", # Optional now! Can specify a list if needed
)
# <a name="Data"></a>
# ### Data Prep
# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
#
# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
# In[5]:
from datasets import load_dataset
dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
# In[6]:
dataset
# In[7]:
dataset[2]["image"]
# In[8]:
dataset[2]["text"]
# We can also render LaTeX directly in the browser!
# In[9]:
from IPython.display import display, Math, Latex
latex = dataset[3]["text"]
display(Math(latex))
# To format the dataset, all vision fine-tuning tasks should follow this format:
#
# ```python
# [
# {
# "role": "user",
# "content": [
# {"type": "text", "text": instruction},
# {"type": "image", "image": sample["image"]},
# ],
# },
# {
# "role": "user",
# "content": [
# {"type": "text", "text": instruction},
# {"type": "image", "image": sample["image"]},
# ],
# },
# ]
# ```
# In[10]:
instruction = "Write the LaTeX representation for this image."
def convert_to_conversation(sample):
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": instruction},
{"type": "image", "image": sample["image"]},
],
},
{"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
]
return {"messages": conversation}
pass
# Let's convert the dataset into the "correct" format for finetuning:
# In[11]:
converted_dataset = [convert_to_conversation(sample) for sample in dataset]
# The first example is now structured like below:
# In[12]:
converted_dataset[0]
# Lets take the Gemma 4 instruction chat template and use it in our base model
# In[13]:
from unsloth import get_chat_template
processor = get_chat_template(
processor,
"gemma-4-thinking"
)
# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
# In[14]:
image = dataset[2]["image"]
instruction = "Write the LaTeX representation for this image."
messages = [
{
"role": "user",
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
inputs = processor(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(processor, skip_prompt = True)
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# You can see it's absolutely terrible! It doesn't follow instructions at all
# <a name="Train"></a>
# ### Train the model
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
#
# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
# In[15]:
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
train_dataset = converted_dataset,
processing_class = processor.tokenizer,
data_collator = UnslothVisionDataCollator(model, processor),
args = SFTConfig(
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4,
max_grad_norm = 0.3,
warmup_ratio = 0.03,
max_steps = 60,
# num_train_epochs = 2, # Set this instead of max_steps for full training runs
learning_rate = 2e-4,
logging_steps = 1,
save_strategy = "steps",
optim = "adamw_8bit",
weight_decay = 0.001,
lr_scheduler_type = "cosine",
seed = 3407,
output_dir = "outputs",
report_to = "none", # For Weights and Biases or others
# You MUST put the below items for vision finetuning:
remove_unused_columns = False,
dataset_text_field = "",
dataset_kwargs = {"skip_prepare_dataset": True},
max_length = 2048,
)
)
# In[16]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
# In[17]:
trainer_stats = trainer.train()
# In[18]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
# <a name="Inference"></a>
# ### Inference
# Let's run the model! You can modify the instruction and input—just leave the output blank.
#
# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
# In[19]:
image = dataset[10]["image"]
instruction = "Write the LaTeX representation for this image."
messages = [
{
"role": "user",
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
inputs = processor(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(processor, skip_prompt = True)
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# <a name="Save"></a>
# ### Saving, loading finetuned models
# To save the final model as LoRA adapters, use Hugging Faces `push_to_hub` for online saving, or `save_pretrained` for local storage.
#
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
# In[20]:
model.save_pretrained("gemma_4_lora") # Local saving
processor.save_pretrained("gemma_4_lora")
# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
# In[21]:
if False:
from unsloth import FastVisionModel
model, processor = FastVisionModel.from_pretrained(
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
load_in_4bit = True, # Set to False for 16bit LoRA
)
sample = dataset[1]
image = sample["image"].convert("RGB")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": sample["text"],
},
{
"type": "image",
},
],
},
]
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
inputs = processor(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# ### Saving to float16 for VLLM
#
# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
# In[22]:
# Select ONLY 1 to save! (Both not needed!)
# Save locally to 16bit
if False: model.save_pretrained_merged("unsloth_finetune", processor,)
# To export and save to your Hugging Face account
if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
#
# Some other resources:
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
#
# <div class="align-center">
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
#
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
# </div>
#
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,513 @@
#!/usr/bin/env python
# coding: utf-8
# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
# <div class="align-center">
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
# </div>
#
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
#
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
# ### News
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
#
# <table><tr>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
# </tr></table>
#
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
#
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
#
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
#
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
# # ### Installation
#
# # In[1]:
#
#
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
#
#
# # In[2]:
#
#
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
#
#
# # ### Unsloth
#
# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
# In[3]:
from unsloth import FastModel
import torch
gemma4_models = [
# Gemma-4 instruct models:
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E4B-it",
"unsloth/gemma-4-31B-it",
"unsloth/gemma-4-26B-A4B-it",
# Gemma-4 base models:
"unsloth/gemma-4-E2B",
"unsloth/gemma-4-E4B",
"unsloth/gemma-4-31B",
"unsloth/gemma-4-26B-A4B",
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastModel.from_pretrained(
model_name = "unsloth/gemma-4-31B-it",
dtype = None, # None for auto detection
max_seq_length = 8192, # Choose any for long context!
load_in_4bit = True, # 4 bit quantization to reduce memory
full_finetuning = False, # [NEW!] We have full finetuning now!
# token = "YOUR_HF_TOKEN", # HF Token for gated models
)
# # Gemma 4 can process Text, Vision and Audio!
#
# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
# In[4]:
from transformers import TextStreamer
# Helper function for inference
def do_gemma_4_inference(messages, max_new_tokens = 128):
_ = model.generate(
**tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
tokenize = True,
return_dict = True,
return_tensors = "pt",
).to("cuda"),
max_new_tokens = max_new_tokens,
use_cache = True,
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(tokenizer, skip_prompt = True),
)
# # Gemma 4 can see images!
#
# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
# In[5]:
sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
messages = [{
"role" : "user",
"content": [
{ "type": "image", "image" : sloth_link },
{ "type": "text", "text" : "Which films does this animal feature in?" }
]
}]
# You might have to wait 1 minute for Unsloth's auto compiler
do_gemma_4_inference(messages, max_new_tokens = 256)
# Let's make a poem about sloths!
# In[6]:
messages = [{
"role": "user",
"content": [{ "type" : "text",
"text" : "Write a poem about sloths." }]
}]
do_gemma_4_inference(messages)
# # Let's finetune Gemma 4!
#
# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
# We now add LoRA adapters so we only need to update a small amount of parameters!
# In[7]:
model = FastModel.get_peft_model(
model,
finetune_vision_layers = False, # Turn off for just text!
finetune_language_layers = True, # Should leave on!
finetune_attention_modules = True, # Attention good for GRPO
finetune_mlp_modules = True, # Should leave on always!
r = 8, # Larger = higher accuracy, but might overfit
lora_alpha = 8, # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
)
# <a name="Data"></a>
# ### Data Prep
# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
#
# ```
# <bos><|turn>user
# Hello<turn|>
# <|turn>model
# Hey there!<turn|>
# ```
# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
# In[8]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "gemma-4-thinking",
)
# We get the first 3000 rows of the dataset
# In[9]:
from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
# In[10]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)
# Let's see how row 100 looks like!
# In[11]:
dataset[100]
# We now have to apply the chat template for `Gemma-4` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
# In[12]:
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
return { "text" : texts, }
dataset = dataset.map(formatting_prompts_func, batched = True)
# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
# In[13]:
dataset[100]["text"]
# <a name="Train"></a>
# ### Train the model
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
# In[14]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
eval_dataset = None, # Can set up evaluation!
args = SFTConfig(
dataset_text_field = "text",
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4, # Use GA to mimic batch size!
warmup_steps = 5,
# num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 60,
learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.001,
lr_scheduler_type = "linear",
seed = 3407,
report_to = "none", # Use TrackIO/WandB etc
),
)
# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
# In[15]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part = "<|turn>user\n",
response_part = "<|turn>model\n",
)
# Let's verify masking the instruction part is done! Let's print the 100th row again. Notice how the sample only has a single `<bos>` as expected!
# In[16]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])
# Now let's print the masked out example - you should see only the answer is present:
# In[17]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
# In[18]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
# # Let's train the model!
#
# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
# In[19]:
trainer_stats = trainer.train()
# In[20]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
# <a name="Inference"></a>
# ### Inference
# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
# In[21]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "gemma-4-thinking",
)
messages = [{
"role": "user",
"content": [{
"type" : "text",
"text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
}]
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens = 64, # Increase for longer outputs!
use_cache = True,
# Recommended Gemma-4 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)
# You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
# In[22]:
messages = [{
"role": "user",
"content": [{"type" : "text", "text" : "Why is the sky blue?",}]
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
from transformers import TextStreamer
_ = model.generate(
**inputs,
max_new_tokens = 64, # Increase for longer outputs!
use_cache = True,
# Recommended Gemma-4 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(tokenizer, skip_prompt = True),
)
# <a name="Save"></a>
# ### Saving, loading finetuned models
# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
#
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
# In[23]:
model.save_pretrained("gemma_4_lora") # Local saving
tokenizer.save_pretrained("gemma_4_lora")
# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
# In[24]:
if False:
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = 2048,
load_in_4bit = True,
)
messages = [{
"role": "user",
"content": [{"type" : "text", "text" : "What is Gemma-4?",}]
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
from transformers import TextStreamer
_ = model.generate(
**inputs,
max_new_tokens = 128, # Increase for longer outputs!
use_cache = True,
# Recommended Gemma-4 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(tokenizer, skip_prompt = True),
)
# ### Saving to float16 for VLLM
#
# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
# In[25]:
if False: # Change to True to save finetune!
model.save_pretrained_merged("gemma-4-finetune", tokenizer)
# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
# In[26]:
if False: # Change to True to upload finetune
model.push_to_hub_merged(
"HF_ACCOUNT/gemma-4-finetune", tokenizer,
token = "YOUR_HF_TOKEN"
)
# ### GGUF / llama.cpp Conversion
# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
# In[27]:
if False: # Change to True to save to GGUF
model.save_pretrained_gguf(
"gemma_4_finetune",
tokenizer,
quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
)
# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
# In[28]:
if False: # Change to True to upload GGUF
model.push_to_hub_gguf(
"HF_ACCOUNT/gemma_4_finetune",
tokenizer,
quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
token = "YOUR_HF_TOKEN",
)
# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
#
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
#
# Some other resources:
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
#
# <div class="align-center">
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
#
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
# </div>
#
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,448 @@
#!/usr/bin/env python
# coding: utf-8
# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
# <div class="align-center">
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
# </div>
#
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
#
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
# ### News
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
#
# <table><tr>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
# </tr></table>
#
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
#
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
#
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
#
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
# # ### Installation
#
# # In[1]:
#
#
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
#
#
# # In[2]:
#
#
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
#
#
# # ### Unsloth
# In[3]:
from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch
gemma4_models = [
# Gemma-4 instruct models:
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E4B-it",
"unsloth/gemma-4-31B-it",
"unsloth/gemma-4-26B-A4B-it",
# Gemma-4 base models:
"unsloth/gemma-4-E2B",
"unsloth/gemma-4-E4B",
"unsloth/gemma-4-31B",
"unsloth/gemma-4-26B-A4B",
] # More models at https://huggingface.co/unsloth
model, processor = FastVisionModel.from_pretrained(
"unsloth/gemma-4-31B-it",
load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)
# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
#
# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
# In[4]:
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers = True, # False if not finetuning vision layers
finetune_language_layers = True, # False if not finetuning language layers
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules = True, # False if not finetuning MLP layers
r = 32, # The larger, the higher the accuracy, but might overfit
lora_alpha = 32, # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
target_modules = "all-linear", # Optional now! Can specify a list if needed
)
# <a name="Data"></a>
# ### Data Prep
# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
#
# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
# In[5]:
from datasets import load_dataset
dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
# In[6]:
dataset
# In[7]:
dataset[2]["image"]
# In[8]:
dataset[2]["text"]
# We can also render LaTeX directly in the browser!
# In[9]:
from IPython.display import display, Math, Latex
latex = dataset[3]["text"]
display(Math(latex))
# To format the dataset, all vision fine-tuning tasks should follow this format:
#
# ```python
# [
# {
# "role": "user",
# "content": [
# {"type": "text", "text": instruction},
# {"type": "image", "image": sample["image"]},
# ],
# },
# {
# "role": "user",
# "content": [
# {"type": "text", "text": instruction},
# {"type": "image", "image": sample["image"]},
# ],
# },
# ]
# ```
# In[10]:
instruction = "Write the LaTeX representation for this image."
def convert_to_conversation(sample):
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": instruction},
{"type": "image", "image": sample["image"]},
],
},
{"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
]
return {"messages": conversation}
pass
# Let's convert the dataset into the "correct" format for finetuning:
# In[11]:
converted_dataset = [convert_to_conversation(sample) for sample in dataset]
# The first example is now structured like below:
# In[12]:
converted_dataset[0]
# Lets take the Gemma 4 instruction chat template and use it in our base model
# In[13]:
from unsloth import get_chat_template
processor = get_chat_template(
processor,
"gemma-4-thinking"
)
# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
# In[14]:
image = dataset[2]["image"]
instruction = "Write the LaTeX representation for this image."
messages = [
{
"role": "user",
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
inputs = processor(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(processor, skip_prompt = True)
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# You can see it's absolutely terrible! It doesn't follow instructions at all
# <a name="Train"></a>
# ### Train the model
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
#
# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
# In[15]:
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
train_dataset = converted_dataset,
processing_class = processor.tokenizer,
data_collator = UnslothVisionDataCollator(model, processor),
args = SFTConfig(
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4,
max_grad_norm = 0.3,
warmup_ratio = 0.03,
max_steps = 60,
# num_train_epochs = 2, # Set this instead of max_steps for full training runs
learning_rate = 2e-4,
logging_steps = 1,
save_strategy = "steps",
optim = "adamw_8bit",
weight_decay = 0.001,
lr_scheduler_type = "cosine",
seed = 3407,
output_dir = "outputs",
report_to = "none", # For Weights and Biases or others
# You MUST put the below items for vision finetuning:
remove_unused_columns = False,
dataset_text_field = "",
dataset_kwargs = {"skip_prepare_dataset": True},
max_length = 2048,
)
)
# In[16]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
# In[17]:
trainer_stats = trainer.train()
# In[18]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
# <a name="Inference"></a>
# ### Inference
# Let's run the model! You can modify the instruction and input—just leave the output blank.
#
# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
# In[19]:
image = dataset[10]["image"]
instruction = "Write the LaTeX representation for this image."
messages = [
{
"role": "user",
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
inputs = processor(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(processor, skip_prompt = True)
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# <a name="Save"></a>
# ### Saving, loading finetuned models
# To save the final model as LoRA adapters, use Hugging Faces `push_to_hub` for online saving, or `save_pretrained` for local storage.
#
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
# In[20]:
model.save_pretrained("gemma_4_lora") # Local saving
processor.save_pretrained("gemma_4_lora")
# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
# In[21]:
if False:
from unsloth import FastVisionModel
model, processor = FastVisionModel.from_pretrained(
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
load_in_4bit = True, # Set to False for 16bit LoRA
)
sample = dataset[1]
image = sample["image"].convert("RGB")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": sample["text"],
},
{
"type": "image",
},
],
},
]
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
inputs = processor(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# ### Saving to float16 for VLLM
#
# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
# In[22]:
# Select ONLY 1 to save! (Both not needed!)
# Save locally to 16bit
if False: model.save_pretrained_merged("unsloth_finetune", processor,)
# To export and save to your Hugging Face account
if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
#
# Some other resources:
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
#
# <div class="align-center">
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
#
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
# </div>
#
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,478 @@
#!/usr/bin/env python
# coding: utf-8
# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
# <div class="align-center">
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
# </div>
#
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
#
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
# ### News
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
#
# <table><tr>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
# </tr></table>
#
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
#
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
#
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
#
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
# # ### Installation
#
# # In[1]:
#
#
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
#
#
# # In[2]:
#
#
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
#
#
# # ### Unsloth
#
# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
# In[3]:
from unsloth import FastModel
import torch
from huggingface_hub import snapshot_download
fourbit_models = [
# Gemma 4 models
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E2B",
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E4B",
"unsloth/gemma-4-31B-it",
"unsloth/gemma-4-31B",
"unsloth/gemma-4-26B-A4B-it",
"unsloth/gemma-4-26B-A4B",
] # More models at https://huggingface.co/unsloth
model, processor = FastModel.from_pretrained(
model_name = "unsloth/gemma-4-E2B-it",
dtype = None, # None for auto detection
max_seq_length = 8192, # Choose any for long context!
load_in_4bit = False, # 4 bit quantization to reduce memory
full_finetuning = False, # [NEW!] We have full finetuning now!
# token = "YOUR_HF_TOKEN", # HF Token for gated models
)
# # Gemma 4 can process Text, Vision and Audio!
#
# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
# In[4]:
from transformers import TextStreamer
# Helper function for inference
def do_gemma_4_inference(messages, max_new_tokens = 128):
_ = model.generate(
**processor.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
tokenize = True,
return_dict = True,
return_tensors = "pt",
).to("cuda"),
max_new_tokens = max_new_tokens,
do_sample = False,
streamer = TextStreamer(processor, skip_prompt = True),
)
# <h3>Let's Evaluate Gemma 4 Baseline Performance on German Transcription</h2>
# In[5]:
from datasets import load_dataset,Audio,concatenate_datasets
dataset = load_dataset("kadirnar/Emilia-DE-B000000", split = "train")
# Select a single audio sample to reserve for testing.
# This index is chosen from the full dataset before we create the smaller training split.
test_audio = dataset[7546]
dataset = dataset.select(range(3000))
dataset = dataset.cast_column("audio", Audio(sampling_rate = 16000))
# In[6]:
from IPython.display import Audio, display
print(test_audio['text'])
Audio(test_audio['audio']['array'],rate = test_audio['audio']['sampling_rate'])
# And the translation of the audio from German to English is:
#
# > I—I hold myself directly accountable. That much is, of course, clear: namely, that there are political interests involved in trade—in the exchange of goods—and that political influences are at play. The question is: that should not be the alternative.
# In[7]:
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an assistant that transcribes speech accurately.",
}
],
},
{
"role": "user",
"content": [
{"type": "audio", "audio": test_audio['audio']['array']},
{"type": "text", "text": "Please transcribe this audio."}
]
}
]
do_gemma_4_inference(messages, max_new_tokens = 256)
# <h3>Baseline Model Performance: 32.43% Word Error Rate (WER) for this sample !</h3>
# # Let's finetune Gemma 4!
#
# You can finetune the vision and text and audio parts
# We now add LoRA adapters so we only need to update a small amount of parameters!
# In[8]:
model = FastModel.get_peft_model(
model,
finetune_vision_layers = False, # False if not finetuning vision layers
finetune_language_layers = True, # False if not finetuning language layers
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules = True, # False if not finetuning MLP layers
r = 8, # The larger, the higher the accuracy, but might overfit
lora_alpha = 16, # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
# Audio layers
"post", "linear_start", "linear_end",
"embedding_projection",
"ffw_layer_1", "ffw_layer_2",
"output_proj",
]
)
# <a name="Data"></a>
# ### Data Prep
# We adapt the `kadirnar/Emilia-DE-B000000` dataset for our German ASR task using Gemma 4 multi-modal chat format. Each audio-text pair is structured into a conversation with `system`, `user`, and `assistant` roles. The processor then converts this into the final training format:
#
# ```
# <bos><|turn>system
# You are an assistant that transcribes speech accurately.<turn|>
# <|turn>user
# <|audio|>Please transcribe this audio.<turn|>
# <|turn>model
# Ich, ich rechne direkt mich an.<turn|>
# In[9]:
def format_intersection_data(samples: dict) -> dict[str, list]:
"""Format intersection dataset to match expected message format"""
formatted_samples = {"messages": []}
for idx in range(len(samples["audio"])):
audio = samples["audio"][idx]["array"]
label = str(samples["text"][idx])
message = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an assistant that transcribes speech accurately.",
}
],
},
{
"role": "user",
"content": [
{"type": "audio", "audio": audio},
{"type": "text", "text": "Please transcribe this audio."}
]
},
{
"role": "assistant",
"content":[{"type": "text", "text": label}]
}
]
formatted_samples["messages"].append(message)
return formatted_samples
# In[10]:
dataset = dataset.map(format_intersection_data, batched = True, batch_size = 4, num_proc = 4)
# <a name="Train"></a>
# ### Train the model
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
# In[11]:
# Use UnslothVisionDataCollator which handles audio token alignment correctly
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
processing_class = processor.tokenizer,
data_collator = UnslothVisionDataCollator(model, processor),
args = SFTConfig(
per_device_train_batch_size = 8,
gradient_accumulation_steps = 1,
warmup_ratio = 0.03,
# num_train_epochs = 1, # Use for full training runs
max_steps = 60,
learning_rate = 5e-5,
logging_steps = 1,
save_strategy = "steps",
optim = "adamw_8bit",
weight_decay = 0.001,
lr_scheduler_type = "cosine",
seed = 3407,
output_dir = "outputs",
report_to = "none",
remove_unused_columns = False,
# The below are a must for audio finetuning:
dataset_text_field = "",
dataset_kwargs = {"skip_prepare_dataset": True},
max_length = 8192,
)
)
# In[12]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
# # Let's train the model!
#
# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
# In[13]:
trainer_stats = trainer.train()
# In[14]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
# <a name="Inference"></a>
# ### Inference
# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
# In[15]:
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an assistant that transcribes speech accurately.",
}
],
},
{
"role": "user",
"content": [
{"type": "audio", "audio": test_audio['audio']['array']},
{"type": "text", "text": "Please transcribe this audio."}
]
}
]
do_gemma_4_inference(messages, max_new_tokens = 256)
# <a name="Save"></a>
# ### Saving, loading finetuned models
# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
#
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
# In[16]:
model.save_pretrained("gemma_4_lora") # Local saving
processor.save_pretrained("gemma_4_lora")
# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# processor.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
# In[17]:
if False:
from unsloth import FastModel
model, processor = FastModel.from_pretrained(
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = 2048,
load_in_4bit = True,
)
messages = [{
"role": "user",
"content": [{"type" : "text", "text" : "What is Gemma-4?",}]
}]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
from transformers import TextStreamer
_ = model.generate(
**inputs,
max_new_tokens = 128, # Increase for longer outputs!
# Recommended Gemma-4 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(processor, skip_prompt = True),
)
# ### Saving to float16 for VLLM
#
# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
# In[18]:
if False: # Change to True to save finetune!
model.save_pretrained_merged("gemma-4", processor)
# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
# In[19]:
if False: # Change to True to upload finetune
model.push_to_hub_merged(
"HF_ACCOUNT/gemma-4-finetune", processor,
token = "YOUR_HF_TOKEN"
)
# ### GGUF / llama.cpp Conversion
# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
# In[20]:
if False: # Change to True to save to GGUF
model.save_pretrained_gguf(
"gemma_4_finetune",
processor,
quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
)
# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
# In[21]:
if False: # Change to True to upload GGUF
model.push_to_hub_gguf(
"HF_ACCOUNT/gemma_4_finetune",
processor,
quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
token = "YOUR_HF_TOKEN",
)
# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
#
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
#
# Some other resources:
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
#
# <div class="align-center">
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
#
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
# </div>
#
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,556 @@
#!/usr/bin/env python
# coding: utf-8
# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
# <div class="align-center">
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
# </div>
#
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
#
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
# ### News
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
#
# <table><tr>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
# </tr></table>
#
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
#
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
#
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
#
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
# # ### Installation
#
# # In[1]:
#
#
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
#
#
# # In[2]:
#
#
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
#
#
# # ### Unsloth
#
# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
# In[3]:
from unsloth import FastModel
import torch
gemma4_models = [
# Gemma-4 instruct models:
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E4B-it",
"unsloth/gemma-4-31B-it",
"unsloth/gemma-4-26B-A4B-it",
# Gemma-4 base models:
"unsloth/gemma-4-E2B",
"unsloth/gemma-4-E4B",
"unsloth/gemma-4-31B",
"unsloth/gemma-4-26B-A4B",
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastModel.from_pretrained(
model_name = "unsloth/gemma-4-E2B-it",
dtype = None, # None for auto detection
max_seq_length = 1024, # Choose any for long context!
load_in_4bit = False, # 4 bit quantization to reduce memory
full_finetuning = False, # [NEW!] We have full finetuning now!
# token = "YOUR_HF_TOKEN", # HF Token for gated models
)
# # Gemma 4 can process Text, Vision and Audio!
#
# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
# In[4]:
from transformers import TextStreamer
# Helper function for inference
def do_gemma_4_inference(messages, max_new_tokens = 128):
_ = model.generate(
**tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
tokenize = True,
return_dict = True,
return_tensors = "pt",
).to("cuda"),
max_new_tokens = max_new_tokens,
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(tokenizer, skip_prompt = True)
)
# # Gemma 4 can see images!
#
# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
# In[5]:
sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
messages = [{
"role" : "user",
"content": [
{ "type": "image", "image" : sloth_link },
{ "type": "text", "text" : "Which films does this animal feature in?" }
]
}]
# You might have to wait 1 minute for Unsloth's auto compiler
do_gemma_4_inference(messages, max_new_tokens = 256)
# Let's make a poem about sloths!
# In[6]:
messages = [{
"role": "user",
"content": [{ "type" : "text",
"text" : "Write a poem about sloths." }]
}]
do_gemma_4_inference(messages)
# # Gemma 4 can also hear!
# In[7]:
from IPython.display import Audio, display
Audio("https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3")
# In[8]:
get_ipython().system('wget -qqq https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3 -O audio.mp3')
# In[9]:
audio_file = "audio.mp3"
messages = [{
"role" : "user",
"content": [
{ "type": "audio", "audio" : audio_file },
{ "type": "text", "text" : "What is this audio about?" }
]
}]
do_gemma_4_inference(messages, max_new_tokens = 256)
# # Let's combine all 3 modalities together!
# In[10]:
messages = [{
"role" : "user",
"content": [
{ "type": "audio", "audio" : audio_file },
{ "type": "image", "image" : sloth_link },
{ "type": "text", "text" : "What is this audio and image about? "\
"How are they related?" }
]
}]
do_gemma_4_inference(messages, max_new_tokens = 256)
# # Let's finetune Gemma 4!
#
# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
# We now add LoRA adapters so we only need to update a small amount of parameters!
# In[11]:
model = FastModel.get_peft_model(
model,
finetune_vision_layers = False, # Turn off for just text!
finetune_language_layers = True, # Should leave on!
finetune_attention_modules = True, # Attention good for GRPO
finetune_mlp_modules = True, # Should leave on always!
r = 8, # Larger = higher accuracy, but might overfit
lora_alpha = 8, # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
)
# <a name="Data"></a>
# ### Data Prep
# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
#
# ```
# <bos><|turn>user
# Hello<turn|>
# <|turn>model
# Hey there!<turn|>
# ```
# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
# In[12]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "gemma-4",
)
# We get the first 3000 rows of the dataset
# In[13]:
from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
# In[14]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)
# Let's see how row 100 looks like!
# In[15]:
dataset[100]
# We now have to apply the chat template for `Gemma-4` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
# In[16]:
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
return { "text" : texts, }
dataset = dataset.map(formatting_prompts_func, batched = True)
# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
# In[17]:
dataset[100]["text"]
# <a name="Train"></a>
# ### Train the model
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
# In[18]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
eval_dataset = None, # Can set up evaluation!
args = SFTConfig(
dataset_text_field = "text",
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4, # Use GA to mimic batch size!
warmup_steps = 5,
# num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 60,
learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.001,
lr_scheduler_type = "linear",
seed = 3407,
report_to = "none", # Use TrackIO/WandB etc
),
)
# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
# In[19]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part = "<|turn>user\n",
response_part = "<|turn>model\n",
)
# Let's verify masking the instruction part is done! Let's print the 100th row again. Notice how the sample only has a single `<bos>` as expected!
# In[20]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])
# Now let's print the masked out example - you should see only the answer is present:
# In[21]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
# In[22]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
# # Let's train the model!
#
# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
# In[23]:
trainer_stats = trainer.train()
# In[24]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
# <a name="Inference"></a>
# ### Inference
# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
# In[25]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "gemma-4",
)
messages = [{
"role": "user",
"content": [{
"type" : "text",
"text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
}]
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens = 64, # Increase for longer outputs!
# Recommended Gemma-4 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)
# You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
# In[26]:
messages = [{
"role": "user",
"content": [{"type" : "text", "text" : "Why is the sky blue?",}]
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
from transformers import TextStreamer
_ = model.generate(
**inputs,
max_new_tokens = 64, # Increase for longer outputs!
# Recommended Gemma-4 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(tokenizer, skip_prompt = True),
)
# <a name="Save"></a>
# ### Saving, loading finetuned models
# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
#
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
# In[27]:
model.save_pretrained("gemma_4_lora") # Local saving
tokenizer.save_pretrained("gemma_4_lora")
# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
# In[28]:
if False:
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = 2048,
load_in_4bit = True,
)
messages = [{
"role": "user",
"content": [{"type" : "text", "text" : "What is Gemma-4?",}]
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
from transformers import TextStreamer
_ = model.generate(
**inputs,
max_new_tokens = 128, # Increase for longer outputs!
# Recommended Gemma-4 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(tokenizer, skip_prompt = True),
)
# ### Saving to float16 for VLLM
#
# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
# In[29]:
if False: # Change to True to save finetune!
model.save_pretrained_merged("gemma-4-finetune", tokenizer)
# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
# In[30]:
if False: # Change to True to upload finetune
model.push_to_hub_merged(
"HF_ACCOUNT/gemma-4-finetune", tokenizer,
token = "YOUR_HF_TOKEN"
)
# ### GGUF / llama.cpp Conversion
# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
# In[31]:
if False: # Change to True to save to GGUF
model.save_pretrained_gguf(
"gemma_4_finetune",
tokenizer,
quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
)
# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
# In[32]:
if False: # Change to True to upload GGUF
model.push_to_hub_gguf(
"HF_ACCOUNT/gemma_4_finetune",
tokenizer,
quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
token = "YOUR_HF_TOKEN",
)
# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
#
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
#
# Some other resources:
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
#
# <div class="align-center">
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
#
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
# </div>
#
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,448 @@
#!/usr/bin/env python
# coding: utf-8
# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
# <div class="align-center">
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
# </div>
#
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
#
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
# ### News
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
#
# <table><tr>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
# </tr></table>
#
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
#
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
#
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
#
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
# # ### Installation
#
# # In[ ]:
#
#
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
#
#
# # In[ ]:
#
#
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
#
#
# # ### Unsloth
# In[ ]:
from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch
gemma4_models = [
# Gemma-4 instruct models:
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E4B-it",
"unsloth/gemma-4-31B-it",
"unsloth/gemma-4-26B-A4B-it",
# Gemma-4 base models:
"unsloth/gemma-4-E2B",
"unsloth/gemma-4-E4B",
"unsloth/gemma-4-31B",
"unsloth/gemma-4-26B-A4B",
] # More models at https://huggingface.co/unsloth
model, processor = FastVisionModel.from_pretrained(
"unsloth/gemma-4-E2B-it",
load_in_4bit = False, # Use 4bit to reduce memory use. False for 16bit LoRA.
use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)
# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
#
# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
# In[ ]:
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers = True, # False if not finetuning vision layers
finetune_language_layers = True, # False if not finetuning language layers
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules = True, # False if not finetuning MLP layers
r = 32, # The larger, the higher the accuracy, but might overfit
lora_alpha = 32, # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
target_modules = "all-linear", # Optional now! Can specify a list if needed
)
# <a name="Data"></a>
# ### Data Prep
# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
#
# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
# In[ ]:
from datasets import load_dataset
dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
# In[ ]:
dataset
# In[ ]:
dataset[2]["image"]
# In[ ]:
dataset[2]["text"]
# We can also render LaTeX directly in the browser!
# In[ ]:
from IPython.display import display, Math, Latex
latex = dataset[3]["text"]
display(Math(latex))
# To format the dataset, all vision fine-tuning tasks should follow this format:
#
# ```python
# [
# {
# "role": "user",
# "content": [
# {"type": "text", "text": instruction},
# {"type": "image", "image": sample["image"]},
# ],
# },
# {
# "role": "user",
# "content": [
# {"type": "text", "text": instruction},
# {"type": "image", "image": sample["image"]},
# ],
# },
# ]
# ```
# In[ ]:
instruction = "Write the LaTeX representation for this image."
def convert_to_conversation(sample):
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": instruction},
{"type": "image", "image": sample["image"]},
],
},
{"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
]
return {"messages": conversation}
pass
# Let's convert the dataset into the "correct" format for finetuning:
# In[ ]:
converted_dataset = [convert_to_conversation(sample) for sample in dataset]
# The first example is now structured like below:
# In[ ]:
converted_dataset[0]
# Lets take the Gemma 4 instruction chat template and use it in our base model
# In[ ]:
from unsloth import get_chat_template
processor = get_chat_template(
processor,
"gemma-4"
)
# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
# In[ ]:
image = dataset[2]["image"]
instruction = "Write the LaTeX representation for this image."
messages = [
{
"role": "user",
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
inputs = processor(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(processor, skip_prompt = True)
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# You can see it's absolutely terrible! It doesn't follow instructions at all
# <a name="Train"></a>
# ### Train the model
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
#
# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
# In[ ]:
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
train_dataset = converted_dataset,
processing_class = processor.tokenizer,
data_collator = UnslothVisionDataCollator(model, processor),
args = SFTConfig(
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4,
max_grad_norm = 0.3,
warmup_ratio = 0.03,
max_steps = 60,
# num_train_epochs = 2, # Set this instead of max_steps for full training runs
learning_rate = 2e-4,
logging_steps = 1,
save_strategy = "steps",
optim = "adamw_8bit",
weight_decay = 0.001,
lr_scheduler_type = "cosine",
seed = 3407,
output_dir = "outputs",
report_to = "none", # For Weights and Biases or others
# You MUST put the below items for vision finetuning:
remove_unused_columns = False,
dataset_text_field = "",
dataset_kwargs = {"skip_prepare_dataset": True},
max_length = 2048,
)
)
# In[ ]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
# In[ ]:
trainer_stats = trainer.train()
# In[ ]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
# <a name="Inference"></a>
# ### Inference
# Let's run the model! You can modify the instruction and input—just leave the output blank.
#
# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
# In[ ]:
image = dataset[10]["image"]
instruction = "Write the LaTeX representation for this image."
messages = [
{
"role": "user",
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
inputs = processor(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(processor, skip_prompt = True)
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# <a name="Save"></a>
# ### Saving, loading finetuned models
# To save the final model as LoRA adapters, use Hugging Faces `push_to_hub` for online saving, or `save_pretrained` for local storage.
#
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
# In[ ]:
model.save_pretrained("gemma_4_lora") # Local saving
processor.save_pretrained("gemma_4_lora")
# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
# In[ ]:
if False:
from unsloth import FastVisionModel
model, processor = FastVisionModel.from_pretrained(
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
load_in_4bit = True, # Set to False for 16bit LoRA
)
sample = dataset[1]
image = sample["image"].convert("RGB")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": sample["text"],
},
{
"type": "image",
},
],
},
]
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
inputs = processor(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# ### Saving to float16 for VLLM
#
# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
# In[ ]:
# Select ONLY 1 to save! (Both not needed!)
# Save locally to 16bit
if False: model.save_pretrained_merged("unsloth_finetune", processor,)
# To export and save to your Hugging Face account
if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
#
# Some other resources:
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
#
# <div class="align-center">
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
#
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
# </div>
#
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,911 @@
#!/usr/bin/env python
# coding: utf-8
# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
# <div class="align-center">
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
# </div>
#
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
#
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
# # ### Installation
#
# # In[ ]:
#
#
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
#
#
# # In[ ]:
#
#
# #@title Colab Extra Install { display-mode: "form" }
# get_ipython().run_line_magic('%capture', '')
# import os
# get_ipython().system('pip install --upgrade -qqq uv')
# if "COLAB_" not in "".join(os.environ.keys()):
# # If you're not in Colab, just use pip install!
# get_ipython().system('pip install unsloth vllm')
# else:
# try: import numpy, PIL; _numpy = f'numpy=={numpy.__version__}'; _pil = f'pillow=={PIL.__version__}'
# except: _numpy = "numpy"; _pil = "pillow"
# try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
# except: is_t4 = False
# _vllm, _triton = ('vllm==0.9.2', 'triton==3.2.0') if is_t4 else ('vllm==0.15.1', 'triton')
# get_ipython().system('uv pip install -qqq --upgrade {_vllm} {_numpy} {_pil} torchvision bitsandbytes xformers unsloth')
# get_ipython().system('uv pip install -qqq {_triton}')
# get_ipython().system('uv pip install transformers==4.56.2')
# get_ipython().system('uv pip install --no-deps trl==0.22.2')
#
#
# # ### Unsloth
# # Goal: Make faster kernels with Reinforcement Learning
#
# Our goal is to make a faster matrix multiplication kernel by doing RL on Gemma 4 with Unsloth.
#
# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/Matrix_multiplication_qtl1.svg/500px-Matrix_multiplication_qtl1.svg.png" height=200 />
#
# You will learn how to:
# 1. Counteract **reward hacking** like cheating, caching, laziness.
# 2. Timing and correctness of kernels and time limits.
# 3. Making good **reward functions**
# 4. How to seriously do RL to make optimized kernels
# In[ ]:
from unsloth import FastVisionModel
import torch
max_seq_length = 4096 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower
gemma4_models = [
# Gemma-4 instruct models:
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E4B-it",
"unsloth/gemma-4-31B-it",
"unsloth/gemma-4-26B-A4B-it",
# Gemma-4 base models:
"unsloth/gemma-4-E2B",
"unsloth/gemma-4-E4B",
"unsloth/gemma-4-31B",
"unsloth/gemma-4-26B-A4B",
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastVisionModel.from_pretrained(
model_name = "unsloth/gemma-4-E2B-it",
max_seq_length = max_seq_length,
load_in_4bit = False, # False for LoRA 16bit
fast_inference = False, # Enable vllm fast inference
)
# We now add some small amount of LoRA weights to Gemma 4 so we only need to train those, instead of training on the full model.
# In[ ]:
model = FastVisionModel.get_peft_model(
model,
r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha = lora_rank*2, # *2 speeds up training
use_gradient_checkpointing = "unsloth", # Reduces memory usage
random_state = 3407,
)
# # Optimized matrix multiplication
#
# Numpy has optimized matrix multiplication kernels for CPUs via BLAS optimized operations. For GPUs, one can use CUDA accelerated cuBLAS kernels which PyTorch calls under the hood.
#
# To generate some random matrices to do matrix multiplication, we can do the below:
# In[ ]:
import numpy as np
def generate_random_matrices(seed = 3407, n = 256):
random_state = np.random.RandomState(seed)
n, k, m = random_state.randint(1, n+1, size = 3)
A = np.random.uniform(-10, 10, size = (n, k))
B = np.random.uniform(-10, 10, size = (k, m))
return A, A.tolist(), B, B.tolist()
# We shall generate a small matrix, and see the matrix multiplied output
# In[ ]:
A, A_list, B, B_list = generate_random_matrices(seed = 42, n = 5)
print(A)
print(B)
print(np.matmul(A, B))
# We can call a LLM to generate a simple matrix multiply kernel in Python only, and we can calculate the differences between the actual result and the kernel's result
# In[ ]:
def calculate_difference(pred, real):
if pred is None: return 5, 5
assert real is not None
import numpy as np
try:
difference = pred - real
except:
return 5, 5
amax_error = float(np.amax(difference))
mse_error = float(np.mean(np.square(difference)))
return amax_error, mse_error
# In[ ]:
# Kernel generated by GPT-5
def matmul(A, B):
z, s = zip, sum
Bt = list(z(*B))
return [[s(a*b for a, b in z(row, col)) for col in Bt] for row in A]
# We see the error below is very small, so that's good!
# In[ ]:
prediction = matmul(A_list, B_list)
calculate_difference(prediction, np.matmul(A, B))
# # Countering Reward Hacking
#
# The ultimate goal of RL is to maximize some reward (say speed, revenue, some metric).
#
# But RL can **cheat** When the RL algorithm learns a trick or exploits something to increase the reward, without actually doing the task at end, this is called "Reward Hacking".
#
# Some good examples are in https://en.wikipedia.org/wiki/Reward_hacking
#
# For matrix multiplication kernels, we might see the following issues:
#
# * Laziness: RL learns to use Numpy, Torch, other libraries, which calls optimized kernels.
# * Caching: RL learns to cache the result of the output
# * Cheating: RL learns to find the actual output by inspecting Python global variables
# * RL learns to edit the timing function to make it output 0 time as passed.
#
# And possibly more. We shall try to address each!
# # Countering Reward Hacking 1: Stop laziness
# We can stop the RL algorithm from calling optimized code by inspecting if the generated code imports other non standard Python libraries. We used GPT-5 to help generate this check `check_only_stdlib_imports`:
# In[ ]:
#@title (Collapsible code)
import ast
import sys
import sysconfig
from pathlib import Path
def _stdlib_names():
"""
Build a set of canonical stdlib top-level module/package names.
Uses sys.stdlib_module_names when available (3.10+), with a
filesystem fallback for older versions/edge cases.
"""
names = {m.lower() for m in getattr(sys, "stdlib_module_names", set())}
names |= {m.lower() for m in sys.builtin_module_names}
names.add("__future__") # special-case
# Fallback/augmentation: scan the stdlib directory
try:
stdlib_dir = Path(sysconfig.get_path("stdlib"))
if stdlib_dir.exists():
for p in stdlib_dir.iterdir():
if p.name == "site-packages":
continue
if p.suffix == ".py":
names.add(p.stem.lower())
elif p.is_dir() and (p / "__init__.py").exists():
names.add(p.name.lower())
except Exception:
# conservative fallback; the names set above will still work well
pass
return names
_STDLIB_SET = _stdlib_names()
def check_only_stdlib_imports(code: str):
"""
Return (ok: bool, details: dict)
ok == True -> all absolute imports are from the stdlib.
ok == False -> details['non_stdlib'] lists offending top-level modules.
details includes:
- stdlib: sorted list of stdlib imports found
- non_stdlib: sorted list of non-stdlib imports found
- relative_imports: count of relative imports (always allowed here)
"""
try:
tree = ast.parse(code)
except SyntaxError as e:
return False, {
"error": f"SyntaxError: {e}",
"stdlib": [],
"non_stdlib": [],
"relative_imports": 0,
}
abs_imports = set()
relative_count = 0
class Visitor(ast.NodeVisitor):
def visit_Import(self, node: ast.Import):
for alias in node.names:
abs_imports.add(alias.name.split(".")[0])
def visit_ImportFrom(self, node: ast.ImportFrom):
nonlocal relative_count
if (node.level or 0) > 0:
# relative import
relative_count += 1
else:
if node.module:
abs_imports.add(node.module.split(".")[0])
Visitor().visit(tree)
stdlib_found = sorted(m for m in abs_imports if m.lower() in _STDLIB_SET)
non_stdlib = sorted(m for m in abs_imports if m.lower() not in _STDLIB_SET)
return len(non_stdlib) == 0, {
"stdlib": stdlib_found,
"non_stdlib": non_stdlib,
"relative_imports": relative_count,
}
# For example, let's call `check_only_stdlib_imports` on a random piece of matrix multiplication code generated by GPT-5:
# In[ ]:
sample = """
def matmul(A, B):
import numpy as np
from torch import matmul
z, s = zip, sum
Bt = list(z(*B))
return [[s(a*b for a, b in z(row, col)) for col in Bt] for row in A]
"""
ok, info = check_only_stdlib_imports(sample)
print("Only stdlib imports?", ok)
print(info)
# # Countering Reward Hacking 2: Stop cheating
# We can stop the RL algorithm from using global or cached variables by restricting it's `locals` and `globals`.
#
# We are also going to use `exec` to create the function, so we have to save the output to an empty dict.
#
# We also disallow global variable access.
# In[ ]:
output_function = {}
exec(sample, {}, output_function)
output_function["matmul"]
# We also disallow global variable access via `types.FunctionType(f.__code__, {})`
# In[ ]:
import types
output_function["matmul"] = types.FunctionType(output_function["matmul"].__code__, {})
def import_numpy():
np.matmul
print("Success")
import_numpy()
import_numpy = types.FunctionType(import_numpy.__code__, {})
try:
import_numpy()
except Exception as e:
print(str(e))
# In[ ]:
def create_locked_down_function(function):
output_function = {}
exec(function, {}, output_function)
new_matmul = output_function["matmul"]
new_matmul = types.FunctionType(new_matmul.__code__, {})
return new_matmul
# # Countering Reward Hacking 3: Stop caching
# We can stop the RL algorithm from using cached data by wiping the cache with a large fake matrix. We also have to benchmark carefully with multiple loops and turns.
#
# We also add a **timer** to not make the algorithm go in an endless loop.
# In[ ]:
import os, gc, time, statistics
import signal
from contextlib import contextmanager
class TimeoutError(Exception): pass
@contextmanager
def time_limit(seconds):
def _handler(signum, frame):
raise TimeoutError(f"Timed out after {seconds}s")
old = signal.signal(signal.SIGALRM, _handler)
signal.setitimer(signal.ITIMER_REAL, seconds)
try:
yield
finally:
signal.setitimer(signal.ITIMER_REAL, 0.0)
signal.signal(signal.SIGALRM, old)
class Benchmarker:
def __init__(self, trials = 3, loops = 1, timeout = 30):
self.buffer = np.zeros(2 * 1024 * 1024 * 1024, dtype = np.uint8)
self.trials = trials
self.loops = loops
assert timeout > 0 # Cannot be 0 since it won't work!
self.timeout = timeout
def thrash(self):
# Edit the buffer to wipe cache lines
self.buffer ^= 1
return int(self.buffer[::4096].sum())
def benchmark(self, function, arguments):
assert len(arguments) == self.loops
samples = []
exceptions = []
timed_out = 0
for _ in range(self.trials):
gc.collect(); gc.disable(); self.thrash()
t_start = time.perf_counter_ns()
for i in range(self.loops):
try:
with time_limit(self.timeout):
function(*arguments[i])
except TimeoutError as e:
timed_out += 1
except Exception as e:
exceptions.append(str(e))
t_end = time.perf_counter_ns()
gc.enable()
samples.append((t_end - t_start) // max(1, self.loops))
return {
"median_ns": int(statistics.median(samples)),
"mean_ns": int(statistics.fmean(samples)),
"stdev_ns": int(statistics.pstdev(samples) if len(samples) > 1 else 0),
"exceptions" : exceptions,
"timeouts" : timed_out,
}
# For example we use our matmul kernel we had, and benchmark it with a 10 second delay:
# In[ ]:
A, A_list, B, B_list = generate_random_matrices(seed = 0, n = 256)
Benchmarker(trials = 1, timeout = 10).benchmark(output_function["matmul"], [(A_list, B_list)])
# # Data & RL task setup
#
# We now have to create a prompt to the model for which it will do some task. For our matrix multiply example, we use the below:
# In[ ]:
prompt = """
Create a new fast matrix multiplication function using only native Python code.
You are given a list of list of numbers.
Output your new function in backticks using the format below:
```python
def matmul(A, B):
return ...
```
""".strip()
print(prompt)
# First, let's prompt Gemma 4 without RL and see how it goes:
# In[ ]:
text = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt.strip()}],
tokenize = False,
add_generation_prompt = True,
)
from transformers import TextStreamer
print("=" * 50)
print("BASE MODEL OUTPUT (before RL training):")
print("=" * 50)
inputs = tokenizer(
text = text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# # Reward functions
#
# We now design the `extract_function` function which simply extracts the function wrapped in 3 backticks.
#
# And 4 reward functions:
#
# 1. `function_works` which rewards the model if the strategy is a valid Python function.
# 2. `no_cheating` which checks if the function imported other modules, and if it did, we penalize it.
# 3. `correctness_check` which checks if the kernel was correct or wrong - it shouldn't generate gibberish!
# 4. `speed_check` checks the performance relative to Numpy matmul directly.
# In[ ]:
def extract_function(text):
if text.count("```") >= 2:
first = text.find("```") + 3
second = text.find("```", first)
fx = text[first : second].strip()
fx = fx.removeprefix("python\n")
fx = fx[fx.find("def"):]
if fx.startswith("def matmul(A, B):"): return fx
return None
print(extract_function(prompt))
# Below is our `function_works` reward function which uses Python's `exec` but guarded by not allowing leakage of local and global variables. We can also use `check_only_stdlib_imports` first to check if there are errors before even executing the function:
# In[ ]:
ok, info = check_only_stdlib_imports("def a")
ok, info
# In[ ]:
def function_works(completions, **kwargs):
scores = []
for completion in completions:
score = 0
response = completion[0]["content"]
function = extract_function(response)
print(function)
if function is not None:
ok, info = check_only_stdlib_imports(function)
if function is None or "error" in info:
score = -2.0
else:
try:
new_matmul = create_locked_down_function(function)
score = 1.0
except:
score = -0.5
scores.append(score)
return scores
# `no_cheating` checks if the function cheated since it might have imported Numpy or Torch optimized code.
# In[ ]:
def no_cheating(completions, **kwargs):
scores = []
for completion in completions:
score = 0
response = completion[0]["content"]
function = extract_function(response)
if function is not None:
ok, info = check_only_stdlib_imports(function)
else:
ok = False
scores.append(1.0 if ok else -20.0) # Penalize heavily!
return scores
# Next `correctness_check` checks if the kernel was correct. We want to penalize if the absolute error is larger than 1, and if the mean squared error is somewhat bigger then machine epsilon.
#
# We have to execute the code now!
# In[ ]:
np.finfo(np.float64).eps
# In[ ]:
def correctness_check(completions, **kwargs):
scores = []
# Generate some random matrices of size less than 128
A, A_list, B, B_list = generate_random_matrices(seed = np.random.randint(10000), n = 128)
for completion in completions:
score = 0
response = completion[0]["content"]
function = extract_function(response)
if function is not None:
ok, info = check_only_stdlib_imports(function)
if function is None or "error" in info:
scores.append(0)
continue
try:
new_matmul = create_locked_down_function(function)
except:
scores.append(0)
continue
try:
pred = new_matmul(A_list.copy(), B_list.copy())
except:
# Failed!
scores.append(-2.0)
continue
true = np.matmul(A, B)
amax_error, mse_error = calculate_difference(pred, true)
# Check correctness and score!
machine_epsilon = 100*np.finfo(np.float64).eps
if amax_error >= 3: score = -3.0
elif amax_error >= 2: score = -2.5
elif amax_error >= 1: score = -2.0
elif amax_error >= 0.5: score = -1.0
elif amax_error >= 100*machine_epsilon: score = 0.0
elif amax_error >= machine_epsilon: score = 1.0
else: score = 3.0
if mse_error >= 3: score += -3.0
elif mse_error >= 2: score += -2.5
elif mse_error >= 1: score += -2.0
elif mse_error >= 0.5: score += -1.0
elif mse_error >= 100*machine_epsilon: score += 0.0
elif mse_error >= machine_epsilon: score += 1.0
else: score += 3.0
scores.append(score)
return scores
# Finally our benchmarking function for `speed_check`! We shall limit the timer to 10 seconds and do 3 trials.
# In[ ]:
A, A_list, B, B_list = generate_random_matrices(seed = 0, n = 256)
benchmarker = Benchmarker(trials = 3, timeout = 10)
numpy_results = benchmarker.benchmark(np.matmul, [(A, B)])
numpy_results
# In[ ]:
new_matmul = create_locked_down_function(extract_function(prompt))
new_results = benchmarker.benchmark(new_matmul, [(A_list, B_list)])
new_results
# We can take the difference and do a negative sign for slower ones. If the ratio is less than 1 (ie faster, we shall invert it!)
# In[ ]:
negative = -(new_results["median_ns"] / numpy_results["median_ns"]) / 100
positive = +(numpy_results["median_ns"] / new_results["median_ns"]) / 100
reward = negative if new_results["median_ns"] >= numpy_results["median_ns"] else positive
reward
# In[ ]:
new_results["median_ns"] = 3
numpy_results["median_ns"] = 1000
negative = -(new_results["median_ns"] / numpy_results["median_ns"]) / 100
positive = +(numpy_results["median_ns"] / new_results["median_ns"]) / 100
reward = negative if new_results["median_ns"] >= numpy_results["median_ns"] else positive
reward
# In[ ]:
import gc
def speed_check(completions, **kwargs):
scores = []
# Generate some random matrices of size less than 256
A, A_list, B, B_list = generate_random_matrices(seed = np.random.randint(10000), n = 256)
numpy_results = benchmarker.benchmark(np.matmul, [(A, B)])
for completion in completions:
score = 0
response = completion[0]["content"]
function = extract_function(response)
if function is not None:
ok, info = check_only_stdlib_imports(function)
if function is None or "error" in info:
scores.append(0)
continue
try:
new_matmul = create_locked_down_function(function)
except:
scores.append(0)
continue
new_results = benchmarker.benchmark(new_matmul, [(A_list.copy(), B_list.copy())])
# Get score and clip to -10, 10
negative = -(new_results["median_ns"] / numpy_results["median_ns"]) / 100
positive = +(numpy_results["median_ns"] / new_results["median_ns"]) / 100
score = negative if new_results["median_ns"] >= numpy_results["median_ns"] else positive
if score >= 10: score = 10
if score <= -10: score = -10
scores.append(score)
# Free memory to counteract OOMs
gc.collect()
torch.cuda.empty_cache()
return scores
# We create the dataset which includes a replica of our prompt.
# In[ ]:
from datasets import Dataset
dataset = Dataset.from_list([{"prompt" : [{"role": "user", "content": prompt.strip()}], "answer" : 0}]*1000)
maximum_length = len(tokenizer.apply_chat_template([{"role":"user", "content":prompt.strip()}], add_generation_prompt = True, tokenize = True))
print(maximum_length)
dataset[0]
# <a name="Train"></a>
# ### Train the model
#
# Now set up GRPO Trainer and all configurations! We also support GSDP, GAPO, Dr GRPO and more! Go to our docs https://unsloth.ai/docs/ for more info!
# In[ ]:
# Leave room for the prompt (plus 1 token safety margin)
max_completion_length = max_seq_length - (maximum_length + 1)
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
temperature = 1.0,
top_p = 0.95,
top_k = 64,
learning_rate = 5e-5,
weight_decay = 0.001,
warmup_ratio = 0.1,
lr_scheduler_type = "linear",
optim = "adamw_8bit",
logging_steps = 1,
per_device_train_batch_size = 1,
gradient_accumulation_steps = 2, # Increase to 4 for smoother training
num_generations = 2, # Decrease if out of memory
max_completion_length = max_completion_length,
# num_train_epochs = 1, # Set to 1 for a full training run
max_steps = 100,
save_steps = 100,
report_to = "none", # Can use Weights & Biases, TrackIO
output_dir = "outputs",
epsilon = 0.2,
epsilon_high = 0.28, # one sided
delta = 1.5, # two sided
loss_type = 'bnpo',
mask_truncated_completions = True
# For optional training + evaluation
# fp16_full_eval = True,
# per_device_eval_batch_size = 4,
# eval_accumulation_steps = 1,
# eval_strategy = "steps",
# eval_steps = 1,
)
# And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!
#
# You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!
#
# | Step | Training Loss | reward | reward_std | completion_length | kl |
# |------|---------------|-----------|------------|-------------------|----------|
# | 1 | 0.000000 | 0.125000 | 0.000000 | 200.000000 | 0.000000 |
# | 2 | 0.000000 | 0.072375 | 0.248112 | 200.000000 | 0.000000 |
# | 3 | 0.000000 | -0.079000 | 0.163776 | 182.500000 | 0.000005 |
# In[ ]:
# For optional training + evaluation
# new_dataset = dataset.train_test_split(test_size = 0.01)
trainer = GRPOTrainer(
model = model,
processing_class = tokenizer,
reward_funcs = [
function_works,
no_cheating,
correctness_check,
speed_check,
],
args = training_args,
train_dataset = dataset,
# For optional training + evaluation
# train_dataset = new_dataset["train"],
# eval_dataset = new_dataset["test"],
)
# And let's train the model!
#
# **NOTE** A T4 free GPU might take 5 minutes for one generation sadly since it's an old GPU - A100 or H100 will be much faster!
# In[ ]:
trainer.train()
# And now with the LoRA we just trained with GRPO - we first save the LoRA first!
# In[ ]:
model.save_pretrained("gemma_4_lora") # Local saving
tokenizer.save_pretrained("gemma_4_lora")
# Verify LoRA is actually trained!
# In[ ]:
from safetensors import safe_open
tensors = {}
with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
# Verify both A and B are non zero
for key in f.keys():
tensor = f.get_tensor(key)
n_zeros = (tensor == 0).sum() / tensor.numel()
assert(n_zeros.item() != tensor.numel())
# <a name="Inference"></a>
# # Inference
# Now let's try the model we just trained!
# In[ ]:
text = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt.strip()}],
tokenize = False,
add_generation_prompt = True,
)
from transformers import TextStreamer
_ = model.generate(
**tokenizer(images = None, text = text, return_tensors = "pt").to("cuda"),
temperature = 1.0, top_p = 0.95, top_k = 64,
max_new_tokens = 1024,
streamer = TextStreamer(tokenizer, skip_prompt = False),
)
# <a name="Save"></a>
# ### Saving to float16 for VLLM
#
# We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
# In[ ]:
# Merge to 16bit
if False: model.save_pretrained_merged("gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit", token = "YOUR_HF_TOKEN")
# Merge to 4bit
if False: model.save_pretrained_merged("gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit", token = "YOUR_HF_TOKEN")
# Just LoRA adapters
if False:
model.save_pretrained("gemma_4_lora")
tokenizer.save_pretrained("gemma_4_lora")
if False:
model.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
tokenizer.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
# ### GGUF / llama.cpp Conversion
# To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
#
# Some supported quant methods (full list on our [docs page](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf)):
# * `q8_0` - Fast conversion. High resource use, but generally acceptable.
# * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
# * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
#
# [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# In[ ]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, token = "YOUR_HF_TOKEN")
# Save to 16bit GGUF
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "f16", token = "YOUR_HF_TOKEN")
# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "q4_k_m", token = "YOUR_HF_TOKEN")
# Save to multiple GGUF options - much faster if you want multiple!
if False:
model.push_to_hub_gguf(
"HF_USERNAME/gemma_4_finetune", # Change hf to your username!
tokenizer,
quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
token = "YOUR_HF_TOKEN",
)
# Now, use the `gemma_4_finetune.Q8_0.gguf` file or `gemma_4_finetune.Q4_K_M.gguf` file in llama.cpp.
#
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
#
# Some other resources:
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
#
# <div class="align-center">
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
#
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
# </div>
#
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,913 @@
#!/usr/bin/env python
# coding: utf-8
# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
# <div class="align-center">
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
# </div>
#
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
#
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
# # Goal: Make Gemma 4 play games with Reinforcement Learning
#
# Our goal is to make Gemma 4 play the 2048 game with reinforcement learning, or a variant of it called [GRPO](https://arxiv.org/abs/2501.12948).
#
# We want the model to devise a strategy to play 2048, and we will run this strategy until we win or lose. We then reward the model if it created a good strategy (winning the game), and we'll penalize it (negative reward) if the strategy was a bad one.
#
# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/2048_win.png/500px-2048_win.png" height=300 />
# # Installation
# We'll be using [Unsloth](https://github.com/unslothai/unsloth) to do RL on Gemma 4. Unsloth saves 70% VRAM usage and makes reinforcement learning 2 to 6x faster!
# In[ ]:
get_ipython().run_cell_magic('capture', '', 'import os, importlib.util\n!pip install --upgrade -qqq uv\nif importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):\n try: import numpy, PIL; _numpy = f"numpy=={numpy.__version__}"; _pil = f"pillow=={PIL.__version__}"\n except: _numpy = "numpy"; _pil = "pillow"\n # Gemma 4 requires transformers >= 5.5.0 — do NOT pin to 4.x here\n !uv pip install -qqq \\\n "torch>=2.8.0" "triton>=3.4.0" {_numpy} {_pil} torchvision bitsandbytes \\\n "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \\\n "unsloth[base] @ git+https://github.com/unslothai/unsloth" \\\n git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels\nelif importlib.util.find_spec("unsloth") is None:\n !uv pip install -qqq unsloth\n# Gemma 4 requires transformers >= 5.5.0\n!uv pip install --upgrade --no-deps "transformers>=5.5.0" tokenizers "trl>=0.28.0" unsloth unsloth_zoo\n')
# In[ ]:
get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
# ### Unsloth
# In[ ]:
from unsloth import FastVisionModel
import torch
max_seq_length = 4096 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower
gemma4_models = [
# Gemma-4 instruct models:
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E4B-it",
"unsloth/gemma-4-31B-it",
"unsloth/gemma-4-26B-A4B-it",
# Gemma-4 base models:
"unsloth/gemma-4-E2B",
"unsloth/gemma-4-E4B",
"unsloth/gemma-4-31B",
"unsloth/gemma-4-26B-A4B",
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastVisionModel.from_pretrained(
model_name = "unsloth/gemma-4-E2B-it",
max_seq_length = max_seq_length,
load_in_4bit = False, # False for LoRA 16bit
fast_inference = False, # Enable vllm fast inference
)
# To do efficient RL, we will use [LoRA](https://arxiv.org/abs/2106.09685), which allows us to only add 1 to 5% of extra weights to the model for finetuning purposes. This allows us to save memory usage by over 60%, and yet it retains good accuracy.
# In[ ]:
model = FastVisionModel.get_peft_model(
model,
r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha = lora_rank*2, # *2 speeds up training
use_gradient_checkpointing = "unsloth", # Reduces memory usage
random_state = 3407,
)
# # 2048 game
#
# We used GPT-5 to create a variant of the 2048 game. It should output the current game board state, and allow us to advance the game board state with 1 action (up, down, left, right).
# In[ ]:
#@title (Collapsible) 2048 Game Implementation
from dataclasses import dataclass, field
from typing import List, Tuple, Optional
import random
import copy
def _compress_and_merge_row_left(row: List[int]) -> Tuple[List[int], int, bool]:
n = len(row)
tiles = [x for x in row if x != 0]
gained = 0
i = 0
merged = []
while i < len(tiles):
if i + 1 < len(tiles) and tiles[i] == tiles[i + 1]:
v = tiles[i] * 2
gained += v
merged.append(v)
i += 2
else:
merged.append(tiles[i])
i += 1
merged += [0] * (n - len(merged))
changed = merged != row
return merged, gained, changed
def _move_left(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
changed_any = False
total_gain = 0
new_board = []
for row in board:
new_row, gained, changed = _compress_and_merge_row_left(row)
new_board.append(new_row)
total_gain += gained
changed_any = changed_any or changed
return new_board, total_gain, changed_any
def _move_right(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
changed_any = False
total_gain = 0
new_board = []
for row in board:
rev = list(reversed(row))
new_rev, gained, changed = _compress_and_merge_row_left(rev)
new_row = list(reversed(new_rev))
new_board.append(new_row)
total_gain += gained
changed_any = changed_any or changed
return new_board, total_gain, changed_any
def _transpose(board: List[List[int]]) -> List[List[int]]:
return [list(row) for row in zip(*board)]
def _move_up(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
t = _transpose(board)
moved, gain, changed = _move_left(t)
return _transpose(moved), gain, changed
def _move_down(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
t = _transpose(board)
moved, gain, changed = _move_right(t)
return _transpose(moved), gain, changed
def _empty_cells(board: List[List[int]]) -> List[Tuple[int, int]]:
size = len(board)
return [(r, c) for r in range(size) for c in range(size) if board[r][c] == 0]
def _can_move(board: List[List[int]]) -> bool:
if _empty_cells(board):
return True
size = len(board)
for r in range(size):
for c in range(size - 1):
if board[r][c] == board[r][c + 1]:
return True
for r in range(size - 1):
for c in range(size):
if board[r][c] == board[r + 1][c]:
return True
return False
@dataclass
class GameBoard:
size: int
seed: Optional[int] = None
target: int = 2048
probability_fours: float = 0.10 # originally spawns (4) 10% of the time!
_rng: random.Random = field(init = False, repr = False)
_board: List[List[int]] = field(init = False, repr = False)
_score: int = field(default = 0, init = False, repr = False)
_state: str = field(default = "ongoing", init = False, repr = False)
def __post_init__(self):
if self.size < 2:
raise ValueError("Board size must be at least 2.")
self._rng = random.Random(self.seed)
self._board = [[0 for _ in range(self.size)] for _ in range(self.size)]
self._add_random_tile()
self._add_random_tile()
self._update_state_after_change()
class _BoardView:
def __init__(self, game: "GameBoard"):
self._game = game
def __iter__(self):
return iter(self._game._board)
def __len__(self):
return len(self._game._board)
def __getitem__(self, idx):
return self._game._board[idx]
def __repr__(self) -> str:
return repr(self._game._board)
__str__ = __repr__
def do_action(self, key: str) -> None:
self._game.do_action(key)
def state(self) -> str:
return self._game.state()
def pretty(self, colors: bool = True, border: bool = True, dot_for_zero: bool = True) -> str:
return self._game._render_pretty(colors = colors, border = border, dot_for_zero = dot_for_zero)
def board(self) -> "_BoardView":
return GameBoard._BoardView(self)
def state(self) -> str:
return self._state
def score(self) -> int:
return self._score
def do_action(self, key: str) -> None:
if self._state != "ongoing":
return
if not isinstance(key, str) or len(key) == 0:
self._state = "failed"
return
k = key.strip().lower()
if k == "q":
self._state = "failed"
return
move_map = {"a": _move_left, "d": _move_right, "w": _move_up, "s": _move_down}
if k not in move_map:
self._state = "failed"
return
mover = move_map[k]
new_board, gain, changed = mover(self._board)
if changed:
self._board = new_board
self._score += gain
self._add_random_tile()
self._update_state_after_change()
def _add_random_tile(self) -> bool:
empties = _empty_cells(self._board)
if not empties:
return False
r, c = self._rng.choice(empties)
self._board[r][c] = 4 if self._rng.random() < self.probability_fours else 2
return True
def _update_state_after_change(self) -> None:
if any(self.target in row for row in self._board):
self._state = "success"
return
if not _can_move(self._board):
self._state = "failed"
return
self._state = "ongoing"
def _render_pretty(self, colors: bool = True, border: bool = True, dot_for_zero: bool = True) -> str:
"""
Pretty-print the board with colors that scale from 0 up to self.target.
Uses ANSI 256-color codes (works in most terminals). Set colors = False to disable.
"""
import math
b = self._board
mx = max((max(row) for row in b), default = 0)
cell_w = max(3, len(str(mx)))
RESET = "\x1b[0m"
# A smooth-ish gradient from cool → warm
# (blue/cyan/green → yellow/orange/red). Tweak or expand as you like.
GRAD = [33, 39, 45, 51, 50, 49, 48, 47, 46, 82, 118, 154, 190, 226, 220, 214, 208, 202, 196]
ZERO_FG = 239 # dim gray
def color_code(v: int) -> str:
if not colors:
return ""
if v == 0:
return f"\x1b[38;5;{ZERO_FG}m"
# Normalize by exponent relative to target: r in [0,1]
t = max(2, self.target) # safety; avoid log2(1)
# Guard: if v is not a power of two or is <1, handle gracefully
try:
r = max(0.0, min(1.0, math.log2(v) / math.log2(t)))
except ValueError:
r = 0.0
idx = int(round(r * (len(GRAD) - 1)))
return f"\x1b[38;5;{GRAD[idx]}m"
def fmt(v: int) -> str:
s = "." if (v == 0 and dot_for_zero) else str(v)
s = s.rjust(cell_w)
return color_code(v) + s + (RESET if colors else "")
def hline(left: str, mid: str, right: str) -> str:
return left + mid.join("" * cell_w for _ in range(self.size)) + right
rows = []
if border:
rows.append(hline("", "", ""))
for r in range(self.size):
content = "".join(fmt(v) for v in b[r])
rows.append(("" + content + "") if border else content)
if border:
rows.append(hline("" if r == self.size - 1 else "",
"" if r == self.size - 1 else "",
"" if r == self.size - 1 else ""))
return "\n".join(rows)
# For example let's create a board of size 5 X 5 and set the target to 8 instead of 2048.
#
# **[NOTE]** 2048 originally spawns a (4) 10% of the time! We can disable this for harder games. See [Wikipedia page](https://en.wikipedia.org/wiki/2048_(video_game)) for more details.
# In[ ]:
game = GameBoard(size = 5, seed = 42, target = 8, probability_fours = 0.10)
print(game.board().pretty(), game.state())
# In[ ]:
game
# We'll use WASD for the action space:
#
# ```
# W
# A S D
# ```
# Also `game.state()` will say `success` if we succeeded in getting the target!
# In[ ]:
game.do_action("A")
print(game.board().pretty(), game.state())
# In[ ]:
game.do_action("W")
print(game.board().pretty(), game.state())
# In[ ]:
game.do_action("D")
print(game.board().pretty(), game.state())
# In[ ]:
game.do_action("W")
print(game.board().pretty(), game.state())
# In[ ]:
game.do_action("D")
print(game.board().pretty(), game.state())
# If we do some other action that's not part of the action space, we will get an error, and the game will not accept anymore actions.
# In[ ]:
game = GameBoard(size = 3, seed = 42, target = 8, probability_fours = 0.10)
game.do_action("AA") # Not in WASD
game.do_action("W") # Doesn't do anything
game.do_action("A") # Doesn't do anything
print(game.board().pretty(), game.state())
# # RL Environment Setup
#
# We'll set up a function to accept some strategy that'll emit an action within `WASD` and check the game state.
#
# We'll also add a timer to only execute the strategy for 2 seconds maximum, otherwise it might never terminate!
# In[ ]:
from typing import Callable
from unsloth import execute_with_time_limit
def _execute_strategy(strategy : Callable, game : GameBoard):
assert callable(strategy)
steps = 0
while game.state() == "ongoing":
action = strategy(list(game.board()))
steps += 1
if type(action) is not str:
return steps, "failed"
game.do_action(action)
return steps, game.state()
@execute_with_time_limit(2)
def execute_strategy(strategy : Callable, game : GameBoard):
return _execute_strategy(strategy, game)
# Let's make a generic strategy to just hit `W`. We should expect this generic strategy to fail:
# In[ ]:
def always_move_left(board):
return "W"
game = GameBoard(size = 8, seed = 42, target = 2048, probability_fours = 0.10)
try:
execute_strategy(always_move_left, game)
except TimeoutError as e:
print(f"Timed out with error = {str(e)}")
# To allow longer strategies for Gemma 4 Reinforcement Learning, we shall allow a 5 second timer.
# In[ ]:
@execute_with_time_limit(5)
def execute_strategy(strategy : Callable, game : GameBoard):
return _execute_strategy(strategy, game)
# # Code Execution
#
# To execute and create a new Python function, we first have to check if the function does not call other global variables or cheat. This is called `countering reward hacking` since we don't want the function to cheat.
#
# For example the below piece of code is fine, since it only imports Python level functions. We use `check_python_modules`:
# In[ ]:
from unsloth import check_python_modules
sample = """
def strategy(board):
import math
from typing import Callable
return "W"
"""
ok, info = check_python_modules(sample)
print("Only Python imports?", ok)
print(info)
# For the below piece of code, since we import `numpy`, we should not allow the execution:
# In[ ]:
sample = """
def strategy(board):
from numpy import matmul
return "W"
"""
ok, info = check_python_modules(sample)
print("Only Python imports?", ok)
print(info)
# We also disallow global variable access. We'll use Unsloth's `create_locked_down_function` function
# In[ ]:
from unsloth import create_locked_down_function
function = """
def import_numpy():
np.matmul
print("Success")
"""
f = create_locked_down_function(function)
try:
f()
except Exception as e:
print(str(e))
# In[ ]:
from unsloth import create_locked_down_function
function = """
def add(a, b):
def adder(a):
return a + b
return adder(b) + b
"""
f = create_locked_down_function(function)
try:
print(f(10, 20))
except Exception as e:
print(str(e))
# # Data & RL task setup
#
# We now have to create a prompt to tell the model to create a strategy for the 2048 game. You can customize this to some other task for another RL task.
# In[ ]:
prompt = """
Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "W", "A", "S", "D" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
return "W" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.
""".strip()
print(prompt)
# First, let's prompt Gemma 4 without RL and see how it goes:
# In[ ]:
text = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt.strip()}],
tokenize = False,
add_generation_prompt = True,
)
from transformers import TextStreamer
print("=" * 50)
print("BASE MODEL OUTPUT (before RL training):")
print("=" * 50)
inputs = tokenizer(
text = text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# # Reward functions
#
# We now design a `extract_function` function which simply extracts the function wrapped in 3 back ticks.
#
# And 3 reward functions:
#
# 1. `function_works` which rewards the model if the strategy is a valid Python function.
# 2. `no_cheating` which checks if the function imported other modules, and if it did, we penalize it.
# 3. `strategy_succeeds` which checks if the game strategy actually succeeds in attaining 2048 after running the auto-generated strategy.
# In[ ]:
def extract_function(text):
if text.count("```") >= 2:
first = text.find("```") + 3
second = text.find("```", first)
fx = text[first : second].strip()
fx = fx.removeprefix("python\n")
fx = fx[fx.find("def"):]
if fx.startswith("def strategy(board):"): return fx
return None
print(extract_function(prompt))
# Below is our `function_works` reward function which uses Python's `exec` but guarded by not allowing leakage of local and global variables. We can also use `check_python_modules` first to check if there are errors before even executing the function:
# In[ ]:
ok, info = check_python_modules("def a")
ok, info
# In[ ]:
def function_works(completions, **kwargs):
scores = []
for completion in completions:
score = 0
response = completion[0]["content"]
function = extract_function(response)
if function is not None:
ok, info = check_python_modules(function)
if function is None or "error" in info:
score = -2.0
else:
try:
new_strategy = create_locked_down_function(function)
score = 1.0
except:
score = -0.5
scores.append(score)
return scores
# `no_cheating` checks if the function cheated since it might have imported Numpy or other functions:
# In[ ]:
def no_cheating(completions, **kwargs):
scores = []
for completion in completions:
score = 0
response = completion[0]["content"]
function = extract_function(response)
if function is not None:
ok, info = check_python_modules(function)
scores.append(1.0 if ok else -20.0) # Penalize heavily!
else:
scores.append(-1.0) # Failed creating function
return scores
# Next `strategy_succeeds` checks if the strategy actually allows the game to terminate. Imagine if the strategy simply returned "W" which would fail after a time limit of 10 seconds.
#
# We also add a global `PRINTER` to print out the strategy and board state.
# In[ ]:
import numpy as np
global PRINTER
PRINTER = 0
def strategy_succeeds(completions, **kwargs):
global PRINTER
scores = []
# Generate a random game board with seed
seed = np.random.randint(10000)
for completion in completions:
printed = False
score = 0
response = completion[0]["content"]
function = extract_function(response)
if PRINTER % 5 == 0:
printed = True
print(function)
PRINTER += 1
if function is not None:
ok, info = check_python_modules(function)
if function is None or "error" in info:
scores.append(0)
continue
try:
new_strategy = create_locked_down_function(function)
except:
scores.append(0)
continue
try:
game = GameBoard(size = 6, seed = seed, target = 2048, probability_fours = 0.10)
steps, game_state = execute_strategy(new_strategy, game)
print(f"Steps = {steps} State = {game_state}")
if printed is False:
print(function)
print(game.board().pretty())
if game_state == "success":
scores.append(20.0) # Success - massively reward!
else:
scores.append(2.0) # Failed but function works!
except TimeoutError as e:
print("Timeout")
scores.append(-1.0) # Failed with timeout
except Exception as e:
print(f"Exception = {str(e)}")
scores.append(-3.0) # Failed
return scores
# We'll now create the dataset which includes a replica of our prompt.
# In[ ]:
from datasets import Dataset
dataset = Dataset.from_list([{"prompt" : [{"role": "user", "content": prompt.strip()}], "answer" : 0}]*1000)
maximum_length = len(tokenizer.apply_chat_template([{"role":"user", "content":prompt.strip()}], add_generation_prompt = True, tokenize = True))
print(maximum_length)
dataset[0]
# <a name="Train"></a>
# ### Train the model
#
# Now set up GRPO Trainer and all configurations! We also support GSPO, GAPO, Dr GRPO and more! Go the Unsloth [Reinforcement Learning Docs](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) for more options.
# In[ ]:
# Leave room for the prompt (plus 1 token safety margin)
max_completion_length = max_seq_length - (maximum_length + 1)
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
temperature = 1.0,
top_p = 0.95,
top_k = 64,
learning_rate = 5e-5,
weight_decay = 0.001,
warmup_ratio = 0.1,
lr_scheduler_type = "linear",
optim = "adamw_8bit",
logging_steps = 1,
per_device_train_batch_size = 1,
gradient_accumulation_steps = 2, # Increase to 4 for smoother training
num_generations = 2, # Decrease if out of memory
max_completion_length = max_completion_length,
# num_train_epochs = 1, # Set to 1 for a full training run
max_steps = 60,
save_steps = 100,
report_to = "none", # Can use Weights & Biases, TrackIO
output_dir = "outputs",
epsilon = 0.2,
epsilon_high = 0.28, # one sided
delta = 1.5, # two sided
loss_type = 'bnpo',
mask_truncated_completions = True
# For optional training + evaluation
# fp16_full_eval = True,
# per_device_eval_batch_size = 4,
# eval_accumulation_steps = 1,
# eval_strategy = "steps",
# eval_steps = 1,
)
# And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!
#
# You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!
#
# | Step | Training Loss | reward | reward_std | completion_length | kl |
# |------|---------------|-----------|------------|-------------------|----------|
# | 1 | 0.000000 | 0.125000 | 0.000000 | 200.000000 | 0.000000 |
# | 2 | 0.000000 | 0.072375 | 0.248112 | 200.000000 | 0.000000 |
# | 3 | 0.000000 | -0.079000 | 0.163776 | 182.500000 | 0.000005 |
# In[ ]:
# For optional training + evaluation
# new_dataset = dataset.train_test_split(test_size = 0.01)
trainer = GRPOTrainer(
model = model,
processing_class = tokenizer,
reward_funcs = [
function_works,
no_cheating,
strategy_succeeds,
],
args = training_args,
train_dataset = dataset,
# For optional training + evaluation
# train_dataset = new_dataset["train"],
# eval_dataset = new_dataset["test"],
)
# And let's train the model!
#
# **NOTE** A T4 free GPU might take 5 minutes for one generation sadly since it's an old GPU - A100 or H100 will be much faster!
# In[ ]:
trainer.train()
# And now with the LoRA we just trained with GRPO - we first save the LoRA first!
# In[ ]:
model.save_pretrained("gemma_4_lora") # Local saving
tokenizer.save_pretrained("gemma_4_lora")
# Verify LoRA is actually trained!
# In[ ]:
from safetensors import safe_open
tensors = {}
with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
# Verify both A and B are non zero
for key in f.keys():
tensor = f.get_tensor(key)
n_zeros = (tensor == 0).sum() / tensor.numel()
assert(n_zeros.item() != tensor.numel())
# <a name="Inference"></a>
# # Inference
# Now let's try the model we just trained!
# In[ ]:
text = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt.strip()}],
tokenize = False,
add_generation_prompt = True,
)
from transformers import TextStreamer
_ = model.generate(
**tokenizer(images = None, text = text, return_tensors = "pt").to("cuda"),
temperature = 1.0, top_p = 0.95, top_k = 64,
max_new_tokens = 1024,
streamer = TextStreamer(tokenizer, skip_prompt = False),
)
# <a name="Save"></a>
# ### Saving to float16 for VLLM
#
# We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
# In[ ]:
# Merge to 16bit
if False: model.save_pretrained_merged("gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit", token = "YOUR_HF_TOKEN")
# Merge to 4bit
if False: model.save_pretrained_merged("gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit", token = "YOUR_HF_TOKEN")
# Just LoRA adapters
if False:
model.save_pretrained("gemma_4_lora")
tokenizer.save_pretrained("gemma_4_lora")
if False:
model.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
tokenizer.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
# ### GGUF / llama.cpp Conversion
# To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
#
# Some supported quant methods (full list on our [docs page](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf)):
# * `q8_0` - Fast conversion. High resource use, but generally acceptable.
# * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
# * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
#
# [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# In[ ]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, token = "YOUR_HF_TOKEN")
# Save to 16bit GGUF
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "f16", token = "YOUR_HF_TOKEN")
# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "q4_k_m", token = "YOUR_HF_TOKEN")
# Save to multiple GGUF options - much faster if you want multiple!
if False:
model.push_to_hub_gguf(
"HF_USERNAME/gemma_4_finetune", # Change hf to your username!
tokenizer,
quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
token = "YOUR_HF_TOKEN",
)
# Now, use the `gemma_4_finetune.Q8_0.gguf` file or `gemma_4_finetune.Q4_K_M.gguf` file in llama.cpp.
#
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
#
# Some other resources:
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
#
# <div class="align-center">
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
#
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
# </div>
#
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,897 @@
#!/usr/bin/env python
# coding: utf-8
# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
# <div class="align-center">
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
# </div>
#
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
#
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
# # Goal: Make Gemma 4 solve Sudoku puzzles with Reinforcement Learning
#
# Our goal is to make Gemma 4 learn to solve Sudoku puzzles using reinforcement learning (GRPO).
# The model will devise a strategy to fill in empty cells, and we'll reward it for correct placements
# and completing valid puzzles.
#
# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/12/Sudoku_Puzzle_by_L2G-20050714_solution_standardized_layout.svg/1280px-Sudoku_Puzzle_by_L2G-20050714_solution_standardized_layout.svg.png" height="300" />
# # Installation
# We'll be using [Unsloth](https://github.com/unslothai/unsloth) to do RL on Gemma 4. Unsloth saves 70% VRAM usage and makes reinforcement learning 2 to 6x faster.
# In[ ]:
get_ipython().run_cell_magic('capture', '', 'import os, importlib.util\n!pip install --upgrade -qqq uv\nif importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):\n try: import numpy, PIL; _numpy = f"numpy=={numpy.__version__}"; _pil = f"pillow=={PIL.__version__}"\n except: _numpy = "numpy"; _pil = "pillow"\n # Gemma 4 requires transformers >= 5.5.0 — do NOT pin to 4.x here\n !uv pip install -qqq \\\n "torch>=2.8.0" "triton>=3.4.0" {_numpy} {_pil} torchvision bitsandbytes \\\n "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \\\n "unsloth[base] @ git+https://github.com/unslothai/unsloth" \\\n git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels\nelif importlib.util.find_spec("unsloth") is None:\n !uv pip install -qqq unsloth\n# Gemma 4 requires transformers >= 5.5.0\n!uv pip install --upgrade --no-deps "transformers>=5.5.0" tokenizers "trl>=0.28.0" unsloth unsloth_zoo\n')
# In[ ]:
get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
# ### Unsloth
# In[ ]:
from unsloth import FastVisionModel
import torch
max_seq_length = 4096 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower
gemma4_models = [
# Gemma-4 instruct models:
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E4B-it",
"unsloth/gemma-4-31B-it",
"unsloth/gemma-4-26B-A4B-it",
# Gemma-4 base models:
"unsloth/gemma-4-E2B",
"unsloth/gemma-4-E4B",
"unsloth/gemma-4-31B",
"unsloth/gemma-4-26B-A4B",
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastVisionModel.from_pretrained(
model_name = "unsloth/gemma-4-E2B-it",
max_seq_length = max_seq_length,
load_in_4bit = False, # False for LoRA 16bit
fast_inference = False, # Enable vllm fast inference
)
# To do efficient RL, we will use [LoRA](https://arxiv.org/abs/2106.09685), which allows us to only add 1 to 5% of extra weights to the model for finetuning purposes. This allows us to save memory usage by over 60%, and yet it retains good accuracy.
# In[ ]:
model = FastVisionModel.get_peft_model(
model,
r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha = lora_rank*2, # *2 speeds up training
use_gradient_checkpointing = "unsloth", # Reduces memory usage
random_state = 3407,
)
# # Sudoku Game Implementation
#
# We use GPT-5 to create a clean Sudoku solver environment. The strategy outputs "row,col,value" to fill cells.
# In[ ]:
#@title Sudoku Game Implementation
from dataclasses import dataclass, field
from typing import List, Tuple, Optional
import random
import copy
def _is_valid_placement(board: List[List[int]], row: int, col: int, num: int) -> bool:
"""Check if placing num at (row, col) is valid."""
# Check row
if num in board[row]:
return False
# Check column
if num in [board[r][col] for r in range(9)]:
return False
# Check 3x3 box
box_row, box_col = 3 * (row // 3), 3 * (col // 3)
for r in range(box_row, box_row + 3):
for c in range(box_col, box_col + 3):
if board[r][c] == num:
return False
return True
def _solve_sudoku(board: List[List[int]]) -> bool:
"""Solve sudoku using backtracking (for puzzle generation)."""
for row in range(9):
for col in range(9):
if board[row][col] == 0:
for num in range(1, 10):
if _is_valid_placement(board, row, col, num):
board[row][col] = num
if _solve_sudoku(board):
return True
board[row][col] = 0
return False
return True
def _generate_complete_board(rng: random.Random) -> List[List[int]]:
"""Generate a complete valid Sudoku board."""
board = [[0 for _ in range(9)] for _ in range(9)]
# Fill diagonal 3x3 boxes first (they don't affect each other)
for box in range(3):
nums = list(range(1, 10))
rng.shuffle(nums)
for i in range(3):
for j in range(3):
board[box * 3 + i][box * 3 + j] = nums[i * 3 + j]
# Solve the rest
_solve_sudoku(board)
return board
@dataclass
class SudokuGame:
difficulty: int = 40 # Number of cells to remove (20 = easy, 40 = medium, 50 = hard)
seed: Optional[int] = None
_rng: random.Random = field(init = False, repr = False)
_board: List[List[int]] = field(init = False, repr = False)
_solution: List[List[int]] = field(init = False, repr = False)
_initial_board: List[List[int]] = field(init = False, repr = False)
_moves: int = field(default = 0, init = False, repr = False)
_state: str = field(default = "ongoing", init = False, repr = False)
def __post_init__(self):
self._rng = random.Random(self.seed)
# Generate complete board
complete_board = _generate_complete_board(self._rng)
self._solution = copy.deepcopy(complete_board)
# Remove cells to create puzzle
self._board = copy.deepcopy(complete_board)
cells = [(r, c) for r in range(9) for c in range(9)]
self._rng.shuffle(cells)
for r, c in cells[:self.difficulty]:
self._board[r][c] = 0
self._initial_board = copy.deepcopy(self._board)
self._update_state()
def board(self) -> List[List[int]]:
"""Return current board state."""
return [row[:] for row in self._board]
def initial_board(self) -> List[List[int]]:
"""Return initial puzzle state."""
return [row[:] for row in self._initial_board]
def state(self) -> str:
"""Return game state: 'ongoing', 'success', or 'failed'."""
return self._state
def moves(self) -> int:
"""Return number of moves made."""
return self._moves
def place_number(self, row: int, col: int, num: int) -> bool:
"""Place a number on the board. Returns True if valid move."""
# Validate input
if not (0 <= row < 9 and 0 <= col < 9):
self._state = "failed"
return False
if not (1 <= num <= 9):
self._state = "failed"
return False
# Can't modify initial cells
if self._initial_board[row][col] != 0:
self._state = "failed"
return False
if self._board[row][col] != 0:
self._state = "failed"
return False
# Check if placement is valid
if not _is_valid_placement(self._board, row, col, num):
self._state = "failed"
return False
# Place number
self._board[row][col] = num
self._moves += 1
self._update_state()
return True
def _update_state(self) -> None:
"""Update game state based on current board."""
# Check if puzzle is complete
if all(self._board[r][c] != 0 for r in range(9) for c in range(9)):
# Verify solution is correct
if self._board == self._solution:
self._state = "success"
else:
self._state = "failed"
else:
self._state = "ongoing"
def pretty(self, colors: bool = True) -> str:
"""Pretty print the Sudoku board."""
RESET = "\x1b[0m"
INITIAL = "\x1b[38;5;45m" # Cyan for initial numbers
PLACED = "\x1b[38;5;226m" # Yellow for placed numbers
EMPTY = "\x1b[38;5;239m" # Gray for empty cells
lines = []
lines.append("┌───────┬───────┬───────┐")
for row in range(9):
row_str = ""
for col in range(9):
num = self._board[row][col]
if colors:
if num == 0:
row_str += f"{EMPTY}.{RESET}"
elif self._initial_board[row][col] != 0:
row_str += f"{INITIAL}{num}{RESET}"
else:
row_str += f"{PLACED}{num}{RESET}"
else:
row_str += str(num) if num != 0 else "."
if col % 3 == 2:
row_str += ""
else:
row_str += " "
lines.append(row_str.rstrip())
if row == 8:
lines.append("└───────┴───────┴───────┘")
elif row % 3 == 2:
lines.append("├───────┼───────┼───────┤")
return "\n".join(lines)
# Test the Sudoku environment:
# In[ ]:
# Create an easy puzzle
game = SudokuGame(difficulty = 30, seed = 42)
print("Initial puzzle:")
print(game.pretty())
print(f"\nState: {game.state()}, Moves: {game.moves()}")
# In[ ]:
game
# Try making some moves:
# In[ ]:
# Make a valid move
game.place_number(0, 1, 7)
print("\nAfter placing 7 at (1,0):")
print(game.pretty())
print(f"State: {game.state()}, Moves: {game.moves()}")
# If we do some other action that's not part of the action space, we will get an error, and the game will not accept anymore actions.
# # RL Environment Setup
#
# Execute strategies with time limits to prevent infinite loops.
# In[ ]:
from typing import Callable
from unsloth import execute_with_time_limit
def _execute_strategy(strategy: Callable, game: SudokuGame):
"""Execute a strategy function on a Sudoku game."""
assert callable(strategy)
max_moves = 100
valid_moves = 0 # Track successful moves
while game.state() == "ongoing" and valid_moves < max_moves:
try:
board = game.board()
initial = game.initial_board()
result = strategy(board, initial)
# Validate result format
if not isinstance(result, (tuple, list)) or len(result) != 3:
# Invalid format = immediate fail, but return valid moves made
return valid_moves, "failed"
row, col, num = result
# Validate types
if not all(isinstance(x, int) for x in [row, col, num]):
return valid_moves, "failed"
# Try to place number
success = game.place_number(row, col, num)
if success:
valid_moves += 1 # Count this valid move
else:
# Invalid move = game fails, but return valid_moves made so far
return valid_moves, "failed"
except Exception:
return valid_moves, "failed"
if valid_moves >= max_moves and game.state() == "ongoing":
return valid_moves, "failed"
return valid_moves, game.state()
# To allow longer strategies for Reinforcement Learning, we shall allow a 10 second timer.
# In[ ]:
@execute_with_time_limit(10)
def execute_strategy(strategy: Callable, game: SudokuGame):
"""Execute strategy with 10 second time limit."""
return _execute_strategy(strategy, game)
# Test with a simple strategy:
# In[ ]:
def simple_strategy(board, initial):
"""Simple strategy: fill first empty cell with 1."""
for r in range(9):
for c in range(9):
if board[r][c] == 0 and initial[r][c] == 0:
return (r, c, 7)
return (0, 0, 7)
game = SudokuGame(difficulty = 30, seed = 42)
try:
moves, state = execute_strategy(simple_strategy, game)
print(f"Moves: {moves}, State: {state}")
except TimeoutError as e:
print(f"Timed out: {e}")
# In[ ]:
print(game.pretty())
# # Code Execution
#
# To execute and create a new Python function, we first have to check if the function does not call other global variables or cheat. This is called `countering reward hacking` since we don't want the function to cheat.
#
# For example the below piece of code is fine, since it only imports Python level functions. We use `check_python_modules`:
# In[ ]:
from unsloth import check_python_modules, create_locked_down_function
# Test safe code
sample = """
def strategy(board, initial):
for r in range(9):
for c in range(9):
if board[r][c] == 0:
return (r, c, 1)
return (0, 0, 1)
"""
ok, info = check_python_modules(sample)
print("Safe Python code?", ok)
print(info)
# For the below piece of code, since we import `numpy`, we should not allow the execution:
# In[ ]:
sample = """
def strategy(board, initial):
import numpy as np
return (0, 0, 1)
"""
ok, info = check_python_modules(sample)
print("Safe Python code?", ok)
print(info)
# # Data & RL task setup
#
# Create the prompt that instructs the model to generate a Sudoku solving strategy. You can customize this to some other task for another RL task.
# In[ ]:
prompt = """
Create a Sudoku solving strategy using only native Python built-in functions without any import statements.
You are given two lists of lists (9x9 grids):
- board: current state (0 means empty)
- initial: starting puzzle (0 means was empty, numbers are fixed)
Return a tuple (row, col, number) for the next move.
- row: 0-8 (row index)
- col: 0-8 (column index)
- number: 1-9 (digit to place)
Only place numbers in cells that are BOTH empty in initial AND empty in board (initial[row][col] == 0 AND board[row][col] == 0)
Use Sudoku rules: no duplicates in rows, columns, or 3x3 boxes.
Output your function in backticks:
```python
def strategy(board, initial):
# Your logic here
return (row, col, number)
```
All helper functions must be inside def strategy. Output only the function.
""".strip()
print(prompt)
# First, let's prompt the model without RL and see how it goes:
# In[ ]:
text = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt.strip()}],
tokenize = False,
add_generation_prompt = True,
)
from transformers import TextStreamer
print("=" * 50)
print("BASE MODEL OUTPUT (before RL training):")
print("=" * 50)
inputs = tokenizer(
text = text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# # Reward functions
#
# We now design a `extract_function` function which simply extracts the function wrapped in 3 back ticks.
#
# And 3 reward functions:
#
# 1. `function_works` which rewards the model if the strategy is a valid Python function.
# 2. `no_cheating` which checks if the function imported other modules, and if it did, we penalize it.
# 3. `strategy_succeeds` which checks if the game strategy actually succeeds in attaining Sudoku after running the auto-generated strategy.
# In[ ]:
def extract_function(text):
"""Extract Python function from markdown code blocks."""
if text.count("```") >= 2:
first = text.find("```") + 3
second = text.find("```", first)
fx = text[first:second].strip()
fx = fx.removeprefix("python\n")
fx = fx[fx.find("def"):]
if fx.startswith("def strategy(board, initial):"):
return fx
return None
# **Reward 1: Function Works**
#
# Checks if the generated code is valid Python and can be executed.
# In[ ]:
def function_works(completions, **kwargs):
"""Reward for generating valid executable Python code."""
scores = []
for completion in completions:
score = 0
response = completion[0]["content"]
function = extract_function(response)
if function is not None:
ok, info = check_python_modules(function)
if function is None or "error" in info:
score = -2.0 # Invalid function
else:
try:
new_strategy = create_locked_down_function(function)
score = 1.0 # Valid function
except:
score = -1.0 # Function has errors
scores.append(score)
return scores
# **Reward 2: No Cheating**
#
# Penalizes functions that import external libraries.
# In[ ]:
def no_cheating(completions, **kwargs):
"""Penalize use of external imports."""
scores = []
for completion in completions:
response = completion[0]["content"]
function = extract_function(response)
if function is not None:
ok, info = check_python_modules(function)
scores.append(1.0 if ok else -20.0) # Heavy penalty for cheating
else:
scores.append(-1.0) # Failed to create function
return scores
# **Reward 3: Strategy Succeeds**
#
# Rewards strategies that successfully solve Sudoku puzzles.
# In[ ]:
import numpy as np
global PRINTER
PRINTER = 0
def strategy_succeeds(completions, **kwargs):
"""Reward valid moves even if strategy eventually fails."""
global PRINTER
scores = []
seed = np.random.randint(10000)
difficulty = 40
for completion in completions:
printed = False
response = completion[0]["content"]
function = extract_function(response)
if PRINTER % 5 == 0:
printed = True
print("\n" + "=" * 60)
print(function)
print("=" * 60)
PRINTER += 1
if function is not None:
ok, info = check_python_modules(function)
if function is None or "error" in info:
scores.append(0)
continue
try:
new_strategy = create_locked_down_function(function)
except:
scores.append(0)
continue
try:
game = SudokuGame(difficulty = difficulty, seed = seed)
valid_moves, game_state = execute_strategy(new_strategy, game)
if valid_moves == difficulty:
game_state = "success"
print(f"\n Valid moves: {valid_moves}, Final state: {game_state}")
if not printed:
print("Strategy:")
print(function[:200] + "..." if len(function) > 200 else function)
print("\nFinal board:")
print(game.pretty())
if game_state == "success":
scores.append(30.0) # Solved the puzzle!
elif valid_moves > 0:
# Reward based on valid moves made before failure
# Each valid move is worth 0.2 points
reward = valid_moves * 0.2
scores.append(reward)
else:
scores.append(-2.0) # Failed immediately with no valid moves
except TimeoutError:
print("Timeout")
scores.append(-1.0)
except Exception as e:
print(f"Exception: {str(e)[:100]}")
scores.append(-3.0)
return scores
# # Dataset Preparation
#
# Create the training dataset.
# In[ ]:
from datasets import Dataset
dataset = Dataset.from_list([
{
"prompt": [{"role": "user", "content": prompt.strip()}],
"answer": 0,
}
] * 1000)
maximum_length = len(tokenizer.apply_chat_template(
[{"role": "user", "content": prompt.strip()}],
add_generation_prompt = True
))
print(f"Maximum prompt length: {maximum_length}")
print("\nDataset sample:")
print(dataset[0])
# <a name="Train"></a>
# ### Train the model
#
# Now set up GRPO Trainer and all configurations! We also support GSPO, GAPO, Dr GRPO and more! Go the Unsloth [Reinforcement Learning Docs](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) for more options.
# In[ ]:
# Leave room for the prompt (plus 1 token safety margin)
max_completion_length = max_seq_length - (maximum_length + 1)
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
temperature = 1.0,
learning_rate = 5e-5,
weight_decay = 0.001,
warmup_ratio = 0.1,
lr_scheduler_type = "linear",
optim = "adamw_8bit",
logging_steps = 1,
per_device_train_batch_size = 1,
gradient_accumulation_steps = 2, # Increase to 4 for smoother training
num_generations = 2, # Decrease if out of memory
max_completion_length = max_completion_length,
# num_train_epochs = 1, # Set to 1 for a full training run
max_steps = 60,
save_steps = 100,
report_to = "none", # Can use Weights & Biases, TrackIO
output_dir = "outputs",
epsilon = 0.2,
epsilon_high = 0.28, # one sided
delta = 1.5, # two sided
loss_type = 'bnpo',
mask_truncated_completions = True
# For optional training + evaluation
# fp16_full_eval = True,
# per_device_eval_batch_size = 4,
# eval_accumulation_steps = 1,
# eval_strategy = "steps",
# eval_steps = 1,
)
# And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!
#
# You might have to wait 150 to 200 steps for any action. You'll probably get low reward for the first 100 steps. Please be patient!
#
# | Step | Training Loss | reward | reward_std | completion_length | kl |
# |------|---------------|-----------|------------|-------------------|----------|
# | 1 | 0.000000 | 0.125000 | 0.000000 | 200.000000 | 0.000000 |
# | 2 | 0.000000 | 0.072375 | 0.248112 | 200.000000 | 0.000000 |
# | 3 | 0.000000 | -0.079000 | 0.163776 | 182.500000 | 0.000005 |
# In[ ]:
# For optional training + evaluation
# new_dataset = dataset.train_test_split(test_size = 0.01)
trainer = GRPOTrainer(
model = model,
processing_class = tokenizer,
reward_funcs = [
function_works,
no_cheating,
strategy_succeeds,
],
args = training_args,
train_dataset = dataset,
# For optional training + evaluation
# train_dataset = new_dataset["train"],
# eval_dataset = new_dataset["test"],
)
# And let's train the model!
#
# **NOTE** A T4 free GPU might take 5 minutes for one generation sadly since it's an old GPU - A100 or H100 will be much faster!
# In[ ]:
trainer.train()
# And now with the LoRA we just trained with GRPO - we first save the LoRA first!
# In[ ]:
model.save_pretrained("gemma_4_lora") # Local saving
tokenizer.save_pretrained("gemma_4_lora")
# Verify LoRA is actually trained!
# In[ ]:
from safetensors import safe_open
tensors = {}
with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
# Verify both A and B are non zero
for key in f.keys():
tensor = f.get_tensor(key)
n_zeros = (tensor == 0).sum() / tensor.numel()
assert(n_zeros.item() != tensor.numel())
# <a name="Inference"></a>
# # Inference
# Now let's try the model we just trained!
# In[ ]:
text = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt.strip()}],
tokenize = False,
add_generation_prompt = True,
)
from transformers import TextStreamer
_ = model.generate(
**tokenizer(images = None,text = text, return_tensors = "pt").to("cuda"),
temperature = 1.0,
max_new_tokens = 512,
streamer = TextStreamer(tokenizer, skip_prompt = False),
)
# <a name="Save"></a>
# ### Saving to float16 for VLLM
#
# We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
# In[ ]:
# Merge to 16bit
if False: model.save_pretrained_merged("gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit", token = "YOUR_HF_TOKEN")
# Merge to 4bit
if False: model.save_pretrained_merged("gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit", token = "YOUR_HF_TOKEN")
# Just LoRA adapters
if False:
model.save_pretrained("gemma_4_lora")
tokenizer.save_pretrained("gemma_4_lora")
if False:
model.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
tokenizer.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
# ### GGUF / llama.cpp Conversion
# To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
#
# Some supported quant methods (full list on our [docs page](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf)):
# * `q8_0` - Fast conversion. High resource use, but generally acceptable.
# * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
# * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
#
# [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# In[ ]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, token = "YOUR_HF_TOKEN")
# Save to 16bit GGUF
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "f16", token = "YOUR_HF_TOKEN")
# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "q4_k_m", token = "YOUR_HF_TOKEN")
# Save to multiple GGUF options - much faster if you want multiple!
if False:
model.push_to_hub_gguf(
"HF_USERNAME/gemma_4_finetune", # Change hf to your username!
tokenizer,
quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
token = "YOUR_HF_TOKEN",
)
# Now, use the `gemma_4_finetune.Q8_0.gguf` file or `gemma_4_finetune.Q4_K_M.gguf` file in llama.cpp.
#
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
#
# Some other resources:
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
#
# <div class="align-center">
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
#
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
# </div>
#
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,478 @@
#!/usr/bin/env python
# coding: utf-8
# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
# <div class="align-center">
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
# </div>
#
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
#
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
# ### News
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
#
# <table><tr>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
# </tr></table>
#
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
#
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
#
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
#
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
# # ### Installation
#
# # In[1]:
#
#
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
#
#
# # In[2]:
#
#
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
#
#
# # ### Unsloth
#
# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
# In[3]:
from unsloth import FastModel
import torch
from huggingface_hub import snapshot_download
fourbit_models = [
# Gemma 4 models
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E2B",
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E4B",
"unsloth/gemma-4-31B-it",
"unsloth/gemma-4-31B",
"unsloth/gemma-4-26B-A4B-it",
"unsloth/gemma-4-26B-A4B",
] # More models at https://huggingface.co/unsloth
model, processor = FastModel.from_pretrained(
model_name = "unsloth/gemma-4-E4B-it",
dtype = None, # None for auto detection
max_seq_length = 8192, # Choose any for long context!
load_in_4bit = True, # 4 bit quantization to reduce memory
full_finetuning = False, # [NEW!] We have full finetuning now!
# token = "YOUR_HF_TOKEN", # HF Token for gated models
)
# # Gemma 4 can process Text, Vision and Audio!
#
# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
# In[4]:
from transformers import TextStreamer
# Helper function for inference
def do_gemma_4_inference(messages, max_new_tokens = 128):
_ = model.generate(
**processor.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
tokenize = True,
return_dict = True,
return_tensors = "pt",
).to("cuda"),
max_new_tokens = max_new_tokens,
do_sample = False,
streamer = TextStreamer(processor, skip_prompt = True),
)
# <h3>Let's Evaluate Gemma 4 Baseline Performance on German Transcription</h2>
# In[5]:
from datasets import load_dataset,Audio,concatenate_datasets
dataset = load_dataset("kadirnar/Emilia-DE-B000000", split = "train")
# Select a single audio sample to reserve for testing.
# This index is chosen from the full dataset before we create the smaller training split.
test_audio = dataset[7546]
dataset = dataset.select(range(3000))
dataset = dataset.cast_column("audio", Audio(sampling_rate = 16000))
# In[6]:
from IPython.display import Audio, display
print(test_audio['text'])
Audio(test_audio['audio']['array'],rate = test_audio['audio']['sampling_rate'])
# And the translation of the audio from German to English is:
#
# > I—I hold myself directly accountable. That much is, of course, clear: namely, that there are political interests involved in trade—in the exchange of goods—and that political influences are at play. The question is: that should not be the alternative.
# In[7]:
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an assistant that transcribes speech accurately.",
}
],
},
{
"role": "user",
"content": [
{"type": "audio", "audio": test_audio['audio']['array']},
{"type": "text", "text": "Please transcribe this audio."}
]
}
]
do_gemma_4_inference(messages, max_new_tokens = 256)
# <h3>Baseline Model Performance: 32.43% Word Error Rate (WER) for this sample !</h3>
# # Let's finetune Gemma 4!
#
# You can finetune the vision and text and audio parts
# We now add LoRA adapters so we only need to update a small amount of parameters!
# In[8]:
model = FastModel.get_peft_model(
model,
finetune_vision_layers = False, # False if not finetuning vision layers
finetune_language_layers = True, # False if not finetuning language layers
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules = True, # False if not finetuning MLP layers
r = 8, # The larger, the higher the accuracy, but might overfit
lora_alpha = 16, # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
# Audio layers
"post", "linear_start", "linear_end",
"embedding_projection",
"ffw_layer_1", "ffw_layer_2",
"output_proj",
]
)
# <a name="Data"></a>
# ### Data Prep
# We adapt the `kadirnar/Emilia-DE-B000000` dataset for our German ASR task using Gemma 4 multi-modal chat format. Each audio-text pair is structured into a conversation with `system`, `user`, and `assistant` roles. The processor then converts this into the final training format:
#
# ```
# <bos><|turn>system
# You are an assistant that transcribes speech accurately.<turn|>
# <|turn>user
# <|audio|>Please transcribe this audio.<turn|>
# <|turn>model
# Ich, ich rechne direkt mich an.<turn|>
# In[9]:
def format_intersection_data(samples: dict) -> dict[str, list]:
"""Format intersection dataset to match expected message format"""
formatted_samples = {"messages": []}
for idx in range(len(samples["audio"])):
audio = samples["audio"][idx]["array"]
label = str(samples["text"][idx])
message = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an assistant that transcribes speech accurately.",
}
],
},
{
"role": "user",
"content": [
{"type": "audio", "audio": audio},
{"type": "text", "text": "Please transcribe this audio."}
]
},
{
"role": "assistant",
"content":[{"type": "text", "text": label}]
}
]
formatted_samples["messages"].append(message)
return formatted_samples
# In[10]:
dataset = dataset.map(format_intersection_data, batched = True, batch_size = 4, num_proc = 4)
# <a name="Train"></a>
# ### Train the model
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
# In[11]:
# Use UnslothVisionDataCollator which handles audio token alignment correctly
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
processing_class = processor.tokenizer,
data_collator = UnslothVisionDataCollator(model, processor),
args = SFTConfig(
per_device_train_batch_size = 8,
gradient_accumulation_steps = 1,
warmup_ratio = 0.03,
# num_train_epochs = 1, # Use for full training runs
max_steps = 60,
learning_rate = 5e-5,
logging_steps = 1,
save_strategy = "steps",
optim = "adamw_8bit",
weight_decay = 0.001,
lr_scheduler_type = "cosine",
seed = 3407,
output_dir = "outputs",
report_to = "none",
remove_unused_columns = False,
# The below are a must for audio finetuning:
dataset_text_field = "",
dataset_kwargs = {"skip_prepare_dataset": True},
max_length = 8192,
)
)
# In[12]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
# # Let's train the model!
#
# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
# In[13]:
trainer_stats = trainer.train()
# In[14]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
# <a name="Inference"></a>
# ### Inference
# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
# In[15]:
messages = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an assistant that transcribes speech accurately.",
}
],
},
{
"role": "user",
"content": [
{"type": "audio", "audio": test_audio['audio']['array']},
{"type": "text", "text": "Please transcribe this audio."}
]
}
]
do_gemma_4_inference(messages, max_new_tokens = 256)
# <a name="Save"></a>
# ### Saving, loading finetuned models
# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
#
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
# In[16]:
model.save_pretrained("gemma_4_lora") # Local saving
processor.save_pretrained("gemma_4_lora")
# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# processor.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
# In[17]:
if False:
from unsloth import FastModel
model, processor = FastModel.from_pretrained(
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = 2048,
load_in_4bit = True,
)
messages = [{
"role": "user",
"content": [{"type" : "text", "text" : "What is Gemma-4?",}]
}]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
from transformers import TextStreamer
_ = model.generate(
**inputs,
max_new_tokens = 128, # Increase for longer outputs!
# Recommended Gemma-4 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(processor, skip_prompt = True),
)
# ### Saving to float16 for VLLM
#
# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
# In[18]:
if False: # Change to True to save finetune!
model.save_pretrained_merged("gemma-4", processor)
# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
# In[19]:
if False: # Change to True to upload finetune
model.push_to_hub_merged(
"HF_ACCOUNT/gemma-4-finetune", processor,
token = "YOUR_HF_TOKEN"
)
# ### GGUF / llama.cpp Conversion
# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
# In[20]:
if False: # Change to True to save to GGUF
model.save_pretrained_gguf(
"gemma_4_finetune",
processor,
quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
)
# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
# In[21]:
if False: # Change to True to upload GGUF
model.push_to_hub_gguf(
"HF_ACCOUNT/gemma_4_finetune",
processor,
quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
token = "YOUR_HF_TOKEN",
)
# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
#
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
#
# Some other resources:
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
#
# <div class="align-center">
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
#
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
# </div>
#
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,557 @@
#!/usr/bin/env python
# coding: utf-8
# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab L4 instance!
# <div class="align-center">
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
# </div>
#
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
#
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
# ### News
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
#
# <table><tr>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
# </tr></table>
#
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
#
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
#
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
#
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
# # ### Installation
#
# # In[1]:
#
#
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
#
#
# # In[2]:
#
#
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
#
#
# # ### Unsloth
#
# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
# In[3]:
from unsloth import FastModel
import torch
gemma4_models = [
# Gemma-4 instruct models:
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E4B-it",
"unsloth/gemma-4-31B-it",
"unsloth/gemma-4-26B-A4B-it",
# Gemma-4 base models:
"unsloth/gemma-4-E2B",
"unsloth/gemma-4-E4B",
"unsloth/gemma-4-31B",
"unsloth/gemma-4-26B-A4B",
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastModel.from_pretrained(
model_name = "unsloth/gemma-4-E4B-it",
dtype = None, # None for auto detection
max_seq_length = 1024, # Choose any for long context!
load_in_4bit = True, # 4 bit quantization to reduce memory
full_finetuning = False, # [NEW!] We have full finetuning now!
# token = "YOUR_HF_TOKEN", # HF Token for gated models
)
# # Gemma 4 can process Text, Vision and Audio!
#
# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
# In[4]:
from transformers import TextStreamer
# Helper function for inference
def do_gemma_4_inference(messages, max_new_tokens = 128):
_ = model.generate(
**tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
tokenize = True,
return_dict = True,
return_tensors = "pt",
).to("cuda"),
max_new_tokens = max_new_tokens,
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(tokenizer, skip_prompt = True),
use_cache = True
)
# # Gemma 4 can see images!
#
# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
# In[5]:
sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
messages = [{
"role" : "user",
"content": [
{ "type": "image", "image" : sloth_link },
{ "type": "text", "text" : "Which films does this animal feature in?" }
]
}]
# You might have to wait 1 minute for Unsloth's auto compiler
do_gemma_4_inference(messages, max_new_tokens = 256)
# Let's make a poem about sloths!
# In[6]:
messages = [{
"role": "user",
"content": [{ "type" : "text",
"text" : "Write a poem about sloths." }]
}]
do_gemma_4_inference(messages)
# # Gemma 4 can also hear!
# In[7]:
from IPython.display import Audio, display
Audio("https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3")
# In[8]:
get_ipython().system('wget -qqq https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3 -O audio.mp3')
# In[9]:
audio_file = "audio.mp3"
messages = [{
"role" : "user",
"content": [
{ "type": "audio", "audio" : audio_file },
{ "type": "text", "text" : "What is this audio about?" }
]
}]
do_gemma_4_inference(messages, max_new_tokens = 256)
# # Let's combine all 3 modalities together!
# In[10]:
messages = [{
"role" : "user",
"content": [
{ "type": "audio", "audio" : audio_file },
{ "type": "image", "image" : sloth_link },
{ "type": "text", "text" : "What is this audio and image about? "\
"How are they related?" }
]
}]
do_gemma_4_inference(messages, max_new_tokens = 256)
# # Let's finetune Gemma 4!
#
# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
# We now add LoRA adapters so we only need to update a small amount of parameters!
# In[11]:
model = FastModel.get_peft_model(
model,
finetune_vision_layers = False, # Turn off for just text!
finetune_language_layers = True, # Should leave on!
finetune_attention_modules = True, # Attention good for GRPO
finetune_mlp_modules = True, # Should leave on always!
r = 8, # Larger = higher accuracy, but might overfit
lora_alpha = 8, # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
)
# <a name="Data"></a>
# ### Data Prep
# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
#
# ```
# <bos><|turn>user
# Hello<turn|>
# <|turn>model
# Hey there!<turn|>
# ```
# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
# In[12]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "gemma-4",
)
# We get the first 3000 rows of the dataset
# In[13]:
from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
# In[14]:
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)
# Let's see how row 100 looks like!
# In[15]:
dataset[100]
# We now have to apply the chat template for `Gemma-4` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
# In[16]:
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
return { "text" : texts, }
dataset = dataset.map(formatting_prompts_func, batched = True)
# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
# In[17]:
dataset[100]["text"]
# <a name="Train"></a>
# ### Train the model
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
# In[18]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
eval_dataset = None, # Can set up evaluation!
args = SFTConfig(
dataset_text_field = "text",
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4, # Use GA to mimic batch size!
warmup_steps = 5,
# num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 60,
learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.001,
lr_scheduler_type = "linear",
seed = 3407,
report_to = "none", # Use TrackIO/WandB etc
),
)
# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
# In[19]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part = "<|turn>user\n",
response_part = "<|turn>model\n",
)
# Let's verify masking the instruction part is done! Let's print the 100th row again. Notice how the sample only has a single `<bos>` as expected!
# In[20]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])
# Now let's print the masked out example - you should see only the answer is present:
# In[21]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
# In[22]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
# # Let's train the model!
#
# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
# In[23]:
trainer_stats = trainer.train()
# In[24]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
# <a name="Inference"></a>
# ### Inference
# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
# In[25]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "gemma-4",
)
messages = [{
"role": "user",
"content": [{
"type" : "text",
"text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
}]
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens = 64, # Increase for longer outputs!
# Recommended Gemma-4 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)
# You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
# In[26]:
messages = [{
"role": "user",
"content": [{"type" : "text", "text" : "Why is the sky blue?",}]
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
from transformers import TextStreamer
_ = model.generate(
**inputs,
max_new_tokens = 64, # Increase for longer outputs!
# Recommended Gemma-4 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(tokenizer, skip_prompt = True),
)
# <a name="Save"></a>
# ### Saving, loading finetuned models
# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
#
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
# In[27]:
model.save_pretrained("gemma_4_lora") # Local saving
tokenizer.save_pretrained("gemma_4_lora")
# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
# In[28]:
if False:
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = 2048,
load_in_4bit = True,
)
messages = [{
"role": "user",
"content": [{"type" : "text", "text" : "What is Gemma-4?",}]
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
from transformers import TextStreamer
_ = model.generate(
**inputs,
max_new_tokens = 128, # Increase for longer outputs!
# Recommended Gemma-4 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(tokenizer, skip_prompt = True),
)
# ### Saving to float16 for VLLM
#
# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
# In[29]:
if False: # Change to True to save finetune!
model.save_pretrained_merged("gemma-4-finetune", tokenizer)
# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
# In[30]:
if False: # Change to True to upload finetune
model.push_to_hub_merged(
"HF_ACCOUNT/gemma-4-finetune", tokenizer,
token = "YOUR_HF_TOKEN"
)
# ### GGUF / llama.cpp Conversion
# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
# In[31]:
if False: # Change to True to save to GGUF
model.save_pretrained_gguf(
"gemma_4_finetune",
tokenizer,
quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
)
# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
# In[32]:
if False: # Change to True to upload GGUF
model.push_to_hub_gguf(
"HF_ACCOUNT/gemma_4_finetune",
tokenizer,
quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
token = "YOUR_HF_TOKEN",
)
# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
#
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
#
# Some other resources:
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
#
# <div class="align-center">
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
#
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
# </div>
#
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,448 @@
#!/usr/bin/env python
# coding: utf-8
# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab L4 instance!
# <div class="align-center">
# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
# </div>
#
# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
#
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
# ### News
# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
#
# <table><tr>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
# </tr></table>
#
# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
#
# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
#
# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
#
# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
# # ### Installation
#
# # In[1]:
#
#
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth # Do this in local & cloud setups\nelse:\n import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
#
#
# # In[2]:
#
#
# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
#
#
# # ### Unsloth
# In[3]:
from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch
gemma4_models = [
# Gemma-4 instruct models:
"unsloth/gemma-4-E2B-it",
"unsloth/gemma-4-E4B-it",
"unsloth/gemma-4-31B-it",
"unsloth/gemma-4-26B-A4B-it",
# Gemma-4 base models:
"unsloth/gemma-4-E2B",
"unsloth/gemma-4-E4B",
"unsloth/gemma-4-31B",
"unsloth/gemma-4-26B-A4B",
] # More models at https://huggingface.co/unsloth
model, processor = FastVisionModel.from_pretrained(
"unsloth/gemma-4-E4B-it",
load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)
# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
#
# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
# In[4]:
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers = True, # False if not finetuning vision layers
finetune_language_layers = True, # False if not finetuning language layers
finetune_attention_modules = True, # False if not finetuning attention layers
finetune_mlp_modules = True, # False if not finetuning MLP layers
r = 32, # The larger, the higher the accuracy, but might overfit
lora_alpha = 32, # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
target_modules = "all-linear", # Optional now! Can specify a list if needed
)
# <a name="Data"></a>
# ### Data Prep
# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
#
# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
# In[5]:
from datasets import load_dataset
dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
# In[6]:
dataset
# In[7]:
dataset[2]["image"]
# In[8]:
dataset[2]["text"]
# We can also render LaTeX directly in the browser!
# In[9]:
from IPython.display import display, Math, Latex
latex = dataset[3]["text"]
display(Math(latex))
# To format the dataset, all vision fine-tuning tasks should follow this format:
#
# ```python
# [
# {
# "role": "user",
# "content": [
# {"type": "text", "text": instruction},
# {"type": "image", "image": sample["image"]},
# ],
# },
# {
# "role": "user",
# "content": [
# {"type": "text", "text": instruction},
# {"type": "image", "image": sample["image"]},
# ],
# },
# ]
# ```
# In[10]:
instruction = "Write the LaTeX representation for this image."
def convert_to_conversation(sample):
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": instruction},
{"type": "image", "image": sample["image"]},
],
},
{"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
]
return {"messages": conversation}
pass
# Let's convert the dataset into the "correct" format for finetuning:
# In[11]:
converted_dataset = [convert_to_conversation(sample) for sample in dataset]
# The first example is now structured like below:
# In[12]:
converted_dataset[0]
# Lets take the Gemma 4 instruction chat template and use it in our base model
# In[13]:
from unsloth import get_chat_template
processor = get_chat_template(
processor,
"gemma-4"
)
# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
# In[14]:
image = dataset[2]["image"]
instruction = "Write the LaTeX representation for this image."
messages = [
{
"role": "user",
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
inputs = processor(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(processor, skip_prompt = True)
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# You can see it's absolutely terrible! It doesn't follow instructions at all
# <a name="Train"></a>
# ### Train the model
# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
#
# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
# In[15]:
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
train_dataset = converted_dataset,
processing_class = processor.tokenizer,
data_collator = UnslothVisionDataCollator(model, processor),
args = SFTConfig(
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4,
max_grad_norm = 0.3,
warmup_ratio = 0.03,
max_steps = 60,
# num_train_epochs = 2, # Set this instead of max_steps for full training runs
learning_rate = 2e-4,
logging_steps = 1,
save_strategy = "steps",
optim = "adamw_8bit",
weight_decay = 0.001,
lr_scheduler_type = "cosine",
seed = 3407,
output_dir = "outputs",
report_to = "none", # For Weights and Biases or others
# You MUST put the below items for vision finetuning:
remove_unused_columns = False,
dataset_text_field = "",
dataset_kwargs = {"skip_prepare_dataset": True},
max_length = 2048,
)
)
# In[16]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
# In[17]:
trainer_stats = trainer.train()
# In[18]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
# <a name="Inference"></a>
# ### Inference
# Let's run the model! You can modify the instruction and input—just leave the output blank.
#
# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
# In[19]:
image = dataset[10]["image"]
instruction = "Write the LaTeX representation for this image."
messages = [
{
"role": "user",
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
inputs = processor(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(processor, skip_prompt = True)
result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# <a name="Save"></a>
# ### Saving, loading finetuned models
# To save the final model as LoRA adapters, use Hugging Faces `push_to_hub` for online saving, or `save_pretrained` for local storage.
#
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
# In[20]:
model.save_pretrained("gemma_4_lora") # Local saving
processor.save_pretrained("gemma_4_lora")
# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
# In[21]:
if False:
from unsloth import FastVisionModel
model, processor = FastVisionModel.from_pretrained(
model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
load_in_4bit = True, # Set to False for 16bit LoRA
)
sample = dataset[1]
image = sample["image"].convert("RGB")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": sample["text"],
},
{
"type": "image",
},
],
},
]
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
inputs = processor(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
# ### Saving to float16 for VLLM
#
# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
# In[22]:
# Select ONLY 1 to save! (Both not needed!)
# Save locally to 16bit
if False: model.save_pretrained_merged("unsloth_finetune", processor,)
# To export and save to your Hugging Face account
if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
#
# Some other resources:
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
#
# <div class="align-center">
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
#
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
# </div>
#
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).