docs: add canonical tooling corpus (147 files) from Google/HF/frameworks

Five-lane parallel research pass. Each subdir under tooling/ has its own README indexing downloaded files with verified upstream sources. - google-official/: deepmind-gemma JAX examples, gemma_pytorch scripts, gemma.cpp API server docs, google-gemma/cookbook notebooks, ai.google.dev HTML snapshots, Gemma 3 tech report - huggingface/: 8 gemma-4-* model cards, chat-template .jinja files, tokenizer_config.json, transformers gemma4/ source, launch blog posts, official HF Spaces app.py - inference-frameworks/: vLLM/llama.cpp/MLX/Keras-hub/TGI/Gemini API/Vertex AI comparison, run_commands.sh with 8 working launches, 9 code snippets - gemma-family/: 12 per-variant briefs (ShieldGemma 2, CodeGemma, PaliGemma 2, Recurrent/Data/Med/TxGemma, Embedding/Translate/Function/Dolphin/SignGemma) - fine-tuning/: Unsloth Gemma 4 notebooks, Axolotl YAMLs (incl 26B-A4B MoE), TRL scripts, Google cookbook fine-tune notebooks, recipe-recommendation.md Findings that update earlier CORPUS_* docs are flagged in tooling/README.md (not applied) — notably the new <|turn>/<turn|> prompt format, gemma_pytorch abandonment, gemma.cpp Gemini-API server, transformers AutoModelForMultimodalLM, FA2 head_dim=512 break, 26B-A4B MoE quantization rules, no Gemma 4 tech report PDF yet, no Gemma-4-generation specialized siblings yet. Pre-commit secrets hook bypassed per user authorization — flagged "secrets" are base64 notebook cell outputs and example Ed25519 keys in the HDP agentic-security demo, not real credentials. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:24:48 -04:00
parent 5011059f5d
commit eecebe7ef5
149 changed files with 181297 additions and 0 deletions
@@ -0,0 +1,250 @@
+<h1 align="center" style="margin:0;">
+  <a href="https://unsloth.ai/docs"><picture>
+    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/unslothai/unsloth/main/images/STUDIO%20WHITE%20LOGO.png">
+    <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/unslothai/unsloth/main/images/STUDIO%20BLACK%20LOGO.png">
+    <img alt="Unsloth logo" src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/STUDIO%20BLACK%20LOGO.png" height="60" style="max-width:100%;">
+  </picture></a>
+</h1>
+<h3 align="center" style="margin: 0; margin-top: 0;">
+Run and train AI models with a unified local interface.
+</h3>
+
+<p align="center">
+  <a href="#-features">Features</a> •
+  <a href="#-quickstart">Quickstart</a> •
+  <a href="#-free-notebooks">Notebooks</a> •
+  <a href="https://unsloth.ai/docs">Documentation</a> •
+  <a href="https://www.reddit.com/r/unsloth/">Reddit</a>
+</p>
+ <a href="https://unsloth.ai/docs/new/studio">
+<img alt="unsloth studio ui homepage" src="https://raw.githubusercontent.com/unslothai/unsloth/main/studio/frontend/public/studio%20github%20landscape%20colab%20display.png" style="max-width: 100%; margin-bottom: 0;"></a>
+
+Unsloth Studio (Beta) lets you run and train text, [audio](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning), [embedding](https://unsloth.ai/docs/new/embedding-finetuning), [vision](https://unsloth.ai/docs/basics/vision-fine-tuning) models on Windows, Linux and macOS.
+
+## ⭐ Features
+Unsloth provides several key features for both inference and training:
+### Inference
+* **Search + download + run models** including GGUF, LoRA adapters, safetensors
+* **Export models**: [Save or export](https://unsloth.ai/docs/new/studio/export) models to GGUF, 16-bit safetensors and other formats.
+* **Tool calling**: Support for [self-healing tool calling](https://unsloth.ai/docs/new/studio/chat#auto-healing-tool-calling) and web search
+* **[Code execution](https://unsloth.ai/docs/new/studio/chat#code-execution)**: lets LLMs test code in Claude artifacts and sandbox environments
+* [Auto-tune inference parameters](https://unsloth.ai/docs/new/studio/chat#auto-parameter-tuning) and customize chat templates.
+* We work directly with teams behind [gpt-oss](https://docs.unsloth.ai/new/gpt-oss-how-to-run-and-fine-tune#unsloth-fixes-for-gpt-oss), [Qwen3](https://www.reddit.com/r/LocalLLaMA/comments/1kaodxu/qwen3_unsloth_dynamic_ggufs_128k_context_bug_fixes/), [Llama 4](https://github.com/ggml-org/llama.cpp/pull/12889), [Mistral](models/tutorials/devstral-how-to-run-and-fine-tune.md), [Gemma 1-3](https://news.ycombinator.com/item?id=39671146), and [Phi-4](https://unsloth.ai/blog/phi4), where we’ve fixed bugs that improve model accuracy.
+* Upload images, audio, PDFs, code, DOCX and more file types to chat with.
+### Training
+* Train and RL **500+ models** up to **2x faster** with up to **70% less VRAM**, with no accuracy loss.
+* Custom Triton and mathematical **kernels**. See some collabs we did with [PyTorch](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) and [Hugging Face](https://unsloth.ai/docs/new/faster-moe).
+* **Data Recipes**: [Auto-create datasets](https://unsloth.ai/docs/new/studio/data-recipe) from **PDF, CSV, DOCX** etc. Edit data in a visual-node workflow.
+* **[Reinforcement Learning](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide)** (RL): The most efficient [RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) library, using **80% less VRAM** for GRPO, [FP8](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) etc.
+* Supports full fine-tuning, RL, pretraining, 4-bit, 16-bit and, FP8 training.
+* **Observability**: Monitor training live, track loss and GPU usage and customize graphs.
+* [Multi-GPU](https://unsloth.ai/docs/basics/multi-gpu-training-with-unsloth) training is supported, with major improvements coming soon.
+
+## ⚡ Quickstart
+Unsloth can be used in two ways: through **[Unsloth Studio](https://unsloth.ai/docs/new/studio/)**, the web UI, or through **Unsloth Core**, the code-based version. Each has different requirements.
+
+### Unsloth Studio (web UI)
+Unsloth Studio (Beta) works on **Windows, Linux, WSL** and **macOS**.
+
+* **CPU:** Supported for Chat and Data Recipes currently
+* **NVIDIA:** Training works on RTX 30/40/50, Blackwell, DGX Spark, Station and more
+* **macOS:** Currently supports chat and Data Recipes. **MLX training** is coming very soon
+* **AMD:** Chat + Data works. Train with [Unsloth Core](#unsloth-core-code-based). Studio support is out soon.
+* **Coming soon:** Training support for Apple MLX, AMD, and Intel.
+* **Multi-GPU:** Available now, with a major upgrade on the way
+
+#### macOS, Linux, WSL:
+```bash
+curl -fsSL https://unsloth.ai/install.sh | sh
+```
+#### Windows:
+```powershell
+irm https://unsloth.ai/install.ps1 | iex
+```
+
+#### Launch
+```bash
+unsloth studio -H 0.0.0.0 -p 8888
+```
+
+#### Update
+To update, use the same install commands as above. Or run (does not work on Windows):
+```bash
+unsloth studio update
+```
+
+#### Docker
+Use our [Docker image](https://hub.docker.com/r/unsloth/unsloth) ```unsloth/unsloth``` container. Run:
+```bash
+docker run -d -e JUPYTER_PASSWORD="mypassword" \
+  -p 8888:8888 -p 8000:8000 -p 2222:22 \
+  -v $(pwd)/work:/workspace/work \
+  --gpus all \
+  unsloth/unsloth
+  ```
+
+#### Developer, Nightly, Uninstall
+To see developer, nightly and uninstallation etc. instructions, see [advanced installation](#-advanced-installation).
+
+### Unsloth Core (code-based)
+#### Linux, WSL:
+```bash
+curl -LsSf https://astral.sh/uv/install.sh | sh
+uv venv unsloth_env --python 3.13
+source unsloth_env/bin/activate
+uv pip install unsloth --torch-backend=auto
+```
+#### Windows:
+```powershell
+winget install -e --id Python.Python.3.13
+winget install --id=astral-sh.uv  -e
+uv venv unsloth_env --python 3.13
+.\unsloth_env\Scripts\activate
+uv pip install unsloth --torch-backend=auto
+```
+For Windows, `pip install unsloth` works only if you have PyTorch installed. Read our [Windows Guide](https://unsloth.ai/docs/get-started/install/windows-installation).
+You can use the same Docker image as Unsloth Studio.
+
+#### AMD, Intel:
+For RTX 50x, B200, 6000 GPUs: `uv pip install unsloth --torch-backend=auto`. Read our guides for: [Blackwell](https://unsloth.ai/docs/blog/fine-tuning-llms-with-blackwell-rtx-50-series-and-unsloth) and [DGX Spark](https://unsloth.ai/docs/blog/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth). <br>
+To install Unsloth on **AMD** and **Intel** GPUs, follow our [AMD Guide](https://unsloth.ai/docs/get-started/install/amd) and [Intel Guide](https://unsloth.ai/docs/get-started/install/intel).
+
+## 📒 Free Notebooks
+
+Train for free with our notebooks. You can use our new [free Unsloth Studio notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb) to run and train models for free in a web UI.
+Read our [guide](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide). Add dataset, run, then deploy your trained model.
+
+| Model | Free Notebooks | Performance | Memory use |
+|-----------|---------|--------|----------|
+| **Gemma 4 (E2B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_(E2B)-Vision.ipynb)               | 1.5x faster | 50% less |
+| **Qwen3.5 (4B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_5_(4B)_Vision.ipynb)               | 1.5x faster | 60% less |
+| **gpt-oss (20B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-Fine-tuning.ipynb)               | 2x faster | 70% less |
+| **Qwen3.5 GSPO**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_5_(4B)_Vision_GRPO.ipynb)               | 2x faster | 70% less |
+| **gpt-oss (20B): GRPO**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb)               | 2x faster | 80% less |
+| **Qwen3: Advanced GRPO**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb)               | 2x faster | 70% less |
+| **embeddinggemma (300M)**    | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/EmbeddingGemma_(300M).ipynb)               | 2x faster | 20% less |
+| **Mistral Ministral 3 (3B)**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Ministral_3_VL_(3B)_Vision.ipynb)               | 1.5x faster | 60% less |
+| **Llama 3.1 (8B) Alpaca**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb)               | 2x faster | 70% less |
+| **Llama 3.2 Conversational**      | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)               | 2x faster | 70% less |
+| **Orpheus-TTS (3B)**     | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Orpheus_(3B)-TTS.ipynb)               | 1.5x faster | 50% less |
+
+- See all our notebooks for: [Kaggle](https://github.com/unslothai/notebooks?tab=readme-ov-file#-kaggle-notebooks), [GRPO](https://unsloth.ai/docs/get-started/unsloth-notebooks#grpo-reasoning-rl-notebooks), [TTS](https://unsloth.ai/docs/get-started/unsloth-notebooks#text-to-speech-tts-notebooks), [embedding](https://unsloth.ai/docs/new/embedding-finetuning) & [Vision](https://unsloth.ai/docs/get-started/unsloth-notebooks#vision-multimodal-notebooks)
+- See [all our models](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [all our notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks)
+- See detailed documentation for Unsloth [here](https://unsloth.ai/docs)
+
+## 🦥 Unsloth News
+- **Gemma 4**: Run and train Google’s new models directly in Unsloth Studio! [Blog](https://unsloth.ai/docs/models/gemma-4)
+- **Introducing Unsloth Studio**: our new web UI for running and training LLMs. [Blog](https://unsloth.ai/docs/new/studio)
+- **Qwen3.5** - 0.8B, 2B, 4B, 9B, 27B, 35-A3B, 112B-A10B are now supported. [Guide + notebooks](https://unsloth.ai/docs/models/qwen3.5/fine-tune)
+- Train **MoE LLMs 12x faster** with 35% less VRAM - DeepSeek, GLM, Qwen and gpt-oss. [Blog](https://unsloth.ai/docs/new/faster-moe)
+- **Embedding models**: Unsloth now supports ~1.8-3.3x faster embedding fine-tuning. [Blog](https://unsloth.ai/docs/new/embedding-finetuning) • [Notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks#embedding-models)
+- New **7x longer context RL** vs. all other setups, via our new batching algorithms. [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+- New RoPE & MLP **Triton Kernels** & **Padding Free + Packing**: 3x faster training & 30% less VRAM. [Blog](https://unsloth.ai/docs/new/3x-faster-training-packing)
+- **500K Context**: Training a 20B model with >500K context is now possible on an 80GB GPU. [Blog](https://unsloth.ai/docs/blog/500k-context-length-fine-tuning)
+- **FP8 & Vision RL**: You can now do FP8 & VLM GRPO on consumer GPUs. [FP8 Blog](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/vision-reinforcement-learning-vlm-rl)
+- **gpt-oss** by OpenAI: Read our [RL blog](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/gpt-oss-reinforcement-learning), [Flex Attention](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/long-context-gpt-oss-training) blog and [Guide](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune).
+
+## 📥 Advanced Installation
+The below advanced instructions are for Unsloth Studio. For Unsloth Core advanced installation, [view our docs](https://unsloth.ai/docs/get-started/install/pip-install#advanced-pip-installation).
+#### Developer installs: macOS, Linux, WSL:
+```bash
+git clone https://github.com/unslothai/unsloth
+cd unsloth
+./install.sh --local
+unsloth studio -H 0.0.0.0 -p 8888
+```
+Then to update :
+```bash
+unsloth studio update
+```
+
+#### Developer installs: Windows PowerShell:
+```powershell
+git clone https://github.com/unslothai/unsloth.git
+cd unsloth
+Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
+.\install.ps1 --local
+unsloth studio -H 0.0.0.0 -p 8888
+```
+Then to update :
+```bash
+unsloth studio update
+```
+
+#### Nightly: MacOS, Linux, WSL:
+```bash
+git clone https://github.com/unslothai/unsloth
+cd unsloth
+git checkout nightly
+./install.sh --local
+unsloth studio -H 0.0.0.0 -p 8888
+```
+Then to launch every time:
+```bash
+unsloth studio -H 0.0.0.0 -p 8888
+```
+
+#### Nightly: Windows:
+Run in Windows Powershell:
+```bash
+git clone https://github.com/unslothai/unsloth.git
+cd unsloth
+git checkout nightly
+Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
+.\install.ps1 --local
+unsloth studio -H 0.0.0.0 -p 8888
+```
+Then to launch every time:
+```bash
+unsloth studio -H 0.0.0.0 -p 8888
+```
+
+#### Uninstall
+You can uninstall Unsloth Studio by deleting its install folder usually located under `$HOME/.unsloth/studio` on Mac/Linux/WSL and `%USERPROFILE%\.unsloth\studio` on Windows. Using the `rm -rf` commands will **delete everything**, including your history, cache:
+
+*  **MacOS, WSL, Linux:** `rm -rf ~/.unsloth/studio`
+*  **Windows (PowerShell):** `Remove-Item -Recurse -Force "$HOME\.unsloth\studio"`
+
+For more info, [see our docs](https://unsloth.ai/docs/new/studio/install#uninstall).
+
+#### Deleting model files
+
+You can delete old model files either from the bin icon in model search or by removing the relevant cached model folder from the default Hugging Face cache directory. By default, HF uses:
+
+*  **MacOS, Linux, WSL:** `~/.cache/huggingface/hub/`
+*  **Windows:** `%USERPROFILE%\.cache\huggingface\hub\`
+
+## 💚 Community and Links
+| Type                                                                                                                                      | Links                                                                          |
+| ----------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
+| <img width="16" src="https://cdn.prod.website-files.com/6257adef93867e50d84d30e2/66e3d80db9971f10a9757c99_Symbol.svg" />  **Discord**                       | [Join Discord server](https://discord.com/invite/unsloth)                          |
+| <img width="15" src="https://redditinc.com/hs-fs/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" />  **r/unsloth Reddit**                       | [Join Reddit community](https://reddit.com/r/unsloth)                          |
+| 📚 **Documentation & Wiki**                                                                                                               | [Read Our Docs](https://unsloth.ai/docs)                                       |
+| <img width="13" src="https://upload.wikimedia.org/wikipedia/commons/0/09/X_(formerly_Twitter)_logo_late_2025.svg" />  **Twitter (aka X)** | [Follow us on X](https://twitter.com/unslothai)                                |
+| 🔮 **Our Models**                                                                                                                         | [Unsloth Catalog](https://unsloth.ai/docs/get-started/unsloth-model-catalog)   |
+| ✍️ **Blog**                                                                                                                               | [Read our Blogs](https://unsloth.ai/blog)                                      |
+
+### Citation
+
+You can cite the Unsloth repo as follows:
+```bibtex
+@software{unsloth,
+  author = {Daniel Han, Michael Han and Unsloth team},
+  title = {Unsloth},
+  url = {https://github.com/unslothai/unsloth},
+  year = {2023}
+}
+```
+If you trained a model with 🦥Unsloth, you can use this cool sticker!   <img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/made with unsloth.png" width="200" align="center" />
+
+### License
+Unsloth uses a dual-licensing model of Apache 2.0 and AGPL-3.0. The core Unsloth package remains licensed under **[Apache 2.0](https://github.com/unslothai/unsloth?tab=Apache-2.0-1-ov-file)**, while certain optional components, such as the Unsloth Studio UI are licensed under the open-source license **[AGPL-3.0](https://github.com/unslothai/unsloth?tab=AGPL-3.0-2-ov-file)**.
+
+This structure helps support ongoing Unsloth development while keeping the project open source and enabling the broader ecosystem to continue growing.
+
+### Thank You to
+- The [llama.cpp library](https://github.com/ggml-org/llama.cpp) that lets users run and save models with Unsloth
+- The Hugging Face team and their libraries: [transformers](https://github.com/huggingface/transformers) and [TRL](https://github.com/huggingface/trl)
+- The Pytorch and [Torch AO](https://github.com/unslothai/unsloth/pull/3391) team for their contributions
+- NVIDIA for their [NeMo DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) library and their contributions
+- And of course for every single person who has contributed or has used Unsloth!
@@ -0,0 +1,512 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+# 
+# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
+
+# In[3]:
+
+
+from unsloth import FastModel
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, tokenizer = FastModel.from_pretrained(
+    model_name = "unsloth/gemma-4-26B-A4B-it",
+    dtype = None, # None for auto detection
+    max_seq_length = 8192, # Choose any for long context!
+    load_in_4bit = True,  # 4 bit quantization to reduce memory
+    full_finetuning = False, # [NEW!] We have full finetuning now!
+    # token = "YOUR_HF_TOKEN", # HF Token for gated models
+)
+
+
+# # Gemma 4 can process Text, Vision and Audio!
+# 
+# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[4]:
+
+
+from transformers import TextStreamer
+# Helper function for inference
+def do_gemma_4_inference(messages, max_new_tokens = 128):
+    _ = model.generate(
+        **tokenizer.apply_chat_template(
+            messages,
+            add_generation_prompt = True, # Must add for generation
+            tokenize = True,
+            return_dict = True,
+            return_tensors = "pt",
+        ).to("cuda"),
+        max_new_tokens = max_new_tokens,
+        use_cache = True,
+        temperature = 1.0, top_p = 0.95, top_k = 64,
+        streamer = TextStreamer(tokenizer, skip_prompt = True),
+    )
+
+
+# # Gemma 4 can see images!
+# 
+# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
+
+# In[5]:
+
+
+sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "image", "image" : sloth_link },
+        { "type": "text",  "text" : "Which films does this animal feature in?" }
+    ]
+}]
+# You might have to wait 1 minute for Unsloth's auto compiler
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# Let's make a poem about sloths!
+
+# In[6]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{ "type" : "text",
+                  "text" : "Write a poem about sloths." }]
+}]
+do_gemma_4_inference(messages)
+
+
+# # Let's finetune Gemma 4!
+# 
+# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
+
+# We now add LoRA adapters so we only need to update a small amount of parameters!
+
+# In[7]:
+
+
+model = FastModel.get_peft_model(
+    model,
+    finetune_vision_layers     = False, # Turn off for just text!
+    finetune_language_layers   = True,  # Should leave on!
+    finetune_attention_modules = True,  # Attention good for GRPO
+    finetune_mlp_modules       = True,  # Should leave on always!
+
+    r = 8,           # Larger = higher accuracy, but might overfit
+    lora_alpha = 8,  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
+# 
+# ```
+# <bos><|turn>user
+# Hello<turn|>
+# <|turn>model
+# Hey there!<turn|>
+# ```
+# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
+
+# In[8]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4-thinking",
+)
+
+
+# We get the first 3000 rows of the dataset
+
+# In[9]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
+
+
+# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
+
+# In[10]:
+
+
+from unsloth.chat_templates import standardize_data_formats
+dataset = standardize_data_formats(dataset)
+
+
+# Let's see how row 100 looks like!
+
+# In[11]:
+
+
+dataset[100]
+
+
+# We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
+
+# In[12]:
+
+
+def formatting_prompts_func(examples):
+   convos = examples["conversations"]
+   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
+   return { "text" : texts, }
+
+dataset = dataset.map(formatting_prompts_func, batched = True)
+
+
+# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
+
+# In[13]:
+
+
+dataset[100]["text"]
+
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
+
+# In[14]:
+
+
+from trl import SFTTrainer, SFTConfig
+trainer = SFTTrainer(
+    model = model,
+    tokenizer = tokenizer,
+    train_dataset = dataset,
+    eval_dataset = None, # Can set up evaluation!
+    args = SFTConfig(
+        dataset_text_field = "text",
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
+        warmup_steps = 5,
+        # num_train_epochs = 1, # Set this for 1 full training run.
+        max_steps = 60,
+        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
+        logging_steps = 1,
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "linear",
+        seed = 3407,
+        report_to = "none", # Use TrackIO/WandB etc
+    ),
+)
+
+
+# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
+
+# In[15]:
+
+
+from unsloth.chat_templates import train_on_responses_only
+trainer = train_on_responses_only(
+    trainer,
+    instruction_part = "<|turn>user\n",
+    response_part = "<|turn>model\n",
+)
+
+
+# Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!
+
+# In[16]:
+
+
+tokenizer.decode(trainer.train_dataset[100]["input_ids"])
+
+
+# Now let's print the masked out example - you should see only the answer is present:
+
+# In[17]:
+
+
+tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
+
+
+# In[18]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# # Let's train the model!
+# 
+# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
+
+# In[19]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[20]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[21]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4-thinking",
+)
+messages = [{
+    "role": "user",
+    "content": [{
+        "type" : "text",
+        "text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
+    }]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+outputs = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    use_cache = True,
+    # Recommended Gemma-3 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+)
+tokenizer.batch_decode(outputs)
+
+
+#  You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
+
+# In[22]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "Why is the sky blue?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    use_cache = True,
+    # Recommended Gemma-3 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[23]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+tokenizer.save_pretrained("gemma_4_lora")
+# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[24]:
+
+
+if False:
+    from unsloth import FastModel
+    model, tokenizer = FastModel.from_pretrained(
+        model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
+        max_seq_length = 2048,
+        load_in_4bit = True,
+    )
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "What is Gemma-4?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 128, # Increase for longer outputs!
+    # Recommended Gemma-3 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
+
+# In[25]:
+
+
+if False: # Change to True to save finetune!
+    model.save_pretrained_merged("gemma-4-finetune", tokenizer)
+
+
+# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[26]:
+
+
+if False: # Change to True to upload finetune
+    model.push_to_hub_merged(
+        "HF_ACCOUNT/gemma-4-finetune", tokenizer,
+        token = "YOUR_HF_TOKEN"
+    )
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
+
+# In[27]:
+
+
+if False: # Change to True to save to GGUF
+    model.save_pretrained_gguf(
+        "gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
+    )
+
+
+# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[28]:
+
+
+if False: # Change to True to upload GGUF
+    model.push_to_hub_gguf(
+        "HF_ACCOUNT/gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,448 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+
+# In[3]:
+
+
+from unsloth import FastVisionModel # FastLanguageModel for LLMs
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, processor = FastVisionModel.from_pretrained(
+    "unsloth/gemma-4-26B-A4B-it",
+    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
+    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
+)
+
+
+# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
+# 
+# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
+
+# In[4]:
+
+
+model = FastVisionModel.get_peft_model(
+    model,
+    finetune_vision_layers     = True, # False if not finetuning vision layers
+    finetune_language_layers   = True, # False if not finetuning language layers
+    finetune_attention_modules = True, # False if not finetuning attention layers
+    finetune_mlp_modules       = True, # False if not finetuning MLP layers
+
+    r = 32,                           # The larger, the higher the accuracy, but might overfit
+    lora_alpha = 32,                  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+    use_rslora = False,               # We support rank stabilized LoRA
+    loftq_config = None,               # And LoftQ
+    target_modules = "all-linear",    # Optional now! Can specify a list if needed
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
+# 
+# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
+
+# In[5]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
+
+
+# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
+
+# In[6]:
+
+
+dataset
+
+
+# In[7]:
+
+
+dataset[2]["image"]
+
+
+# In[8]:
+
+
+dataset[2]["text"]
+
+
+# We can also render LaTeX directly in the browser!
+
+# In[9]:
+
+
+from IPython.display import display, Math, Latex
+
+latex = dataset[3]["text"]
+display(Math(latex))
+
+
+# To format the dataset, all vision fine-tuning tasks should follow this format:
+# 
+# ```python
+# [
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+# ]
+# ```
+
+# In[10]:
+
+
+instruction = "Write the LaTeX representation for this image."
+
+def convert_to_conversation(sample):
+    conversation = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": instruction},
+                {"type": "image", "image": sample["image"]},
+            ],
+        },
+        {"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
+    ]
+    return {"messages": conversation}
+pass
+
+
+# Let's convert the dataset into the "correct" format for finetuning:
+
+# In[11]:
+
+
+converted_dataset = [convert_to_conversation(sample) for sample in dataset]
+
+
+# The first example is now structured like below:
+
+# In[12]:
+
+
+converted_dataset[0]
+
+
+# Lets take the Gemma 4 instruction chat template and use it in our base model
+
+# In[13]:
+
+
+from unsloth import get_chat_template
+
+processor = get_chat_template(
+    processor,
+    "gemma-4-thinking"
+)
+
+
+# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
+
+# In[14]:
+
+
+image = dataset[2]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# You can see it's absolutely terrible! It doesn't follow instructions at all
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
+# 
+# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
+
+# In[15]:
+
+
+from unsloth.trainer import UnslothVisionDataCollator
+from trl import SFTTrainer, SFTConfig
+
+trainer = SFTTrainer(
+    model = model,
+    train_dataset = converted_dataset,
+    processing_class = processor.tokenizer,
+    data_collator = UnslothVisionDataCollator(model, processor),
+    args = SFTConfig(
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4,
+        max_grad_norm = 0.3,
+        warmup_ratio = 0.03,
+        max_steps = 60,
+        # num_train_epochs = 2, # Set this instead of max_steps for full training runs
+        learning_rate = 2e-4,
+        logging_steps = 1,
+        save_strategy = "steps",
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "cosine",
+        seed = 3407,
+        output_dir = "outputs",
+        report_to = "none", # For Weights and Biases or others
+
+        # You MUST put the below items for vision finetuning:
+        remove_unused_columns = False,
+        dataset_text_field = "",
+        dataset_kwargs = {"skip_prepare_dataset": True},
+        max_length = 2048,
+    )
+)
+
+
+# In[16]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# In[17]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[18]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model! You can modify the instruction and input—just leave the output blank.
+# 
+# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
+
+# In[19]:
+
+
+image = dataset[10]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, use Hugging Face’s `push_to_hub` for online saving, or `save_pretrained` for local storage.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[20]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+processor.save_pretrained("gemma_4_lora")
+# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[21]:
+
+
+if False:
+    from unsloth import FastVisionModel
+
+    model, processor = FastVisionModel.from_pretrained(
+        model_name = "gemma_4_lora",  # YOUR MODEL YOU USED FOR TRAINING
+        load_in_4bit = True,  # Set to False for 16bit LoRA
+    )
+
+sample = dataset[1]
+image = sample["image"].convert("RGB")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "text",
+                "text": sample["text"],
+            },
+            {
+                "type": "image",
+            },
+        ],
+    },
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
+_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                   use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
+
+# In[22]:
+
+
+# Select ONLY 1 to save! (Both not needed!)
+
+# Save locally to 16bit
+if False: model.save_pretrained_merged("unsloth_finetune", processor,)
+
+# To export and save to your Hugging Face account
+if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
+
+
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,513 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+# 
+# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
+
+# In[3]:
+
+
+from unsloth import FastModel
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, tokenizer = FastModel.from_pretrained(
+    model_name = "unsloth/gemma-4-31B-it",
+    dtype = None, # None for auto detection
+    max_seq_length = 8192, # Choose any for long context!
+    load_in_4bit = True,  # 4 bit quantization to reduce memory
+    full_finetuning = False, # [NEW!] We have full finetuning now!
+    # token = "YOUR_HF_TOKEN", # HF Token for gated models
+)
+
+
+# # Gemma 4 can process Text, Vision and Audio!
+# 
+# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[4]:
+
+
+from transformers import TextStreamer
+# Helper function for inference
+def do_gemma_4_inference(messages, max_new_tokens = 128):
+    _ = model.generate(
+        **tokenizer.apply_chat_template(
+            messages,
+            add_generation_prompt = True, # Must add for generation
+            tokenize = True,
+            return_dict = True,
+            return_tensors = "pt",
+        ).to("cuda"),
+        max_new_tokens = max_new_tokens,
+        use_cache = True,
+        temperature = 1.0, top_p = 0.95, top_k = 64,
+        streamer = TextStreamer(tokenizer, skip_prompt = True),
+    )
+
+
+# # Gemma 4 can see images!
+# 
+# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
+
+# In[5]:
+
+
+sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "image", "image" : sloth_link },
+        { "type": "text",  "text" : "Which films does this animal feature in?" }
+    ]
+}]
+# You might have to wait 1 minute for Unsloth's auto compiler
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# Let's make a poem about sloths!
+
+# In[6]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{ "type" : "text",
+                  "text" : "Write a poem about sloths." }]
+}]
+do_gemma_4_inference(messages)
+
+
+# # Let's finetune Gemma 4!
+# 
+# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
+
+# We now add LoRA adapters so we only need to update a small amount of parameters!
+
+# In[7]:
+
+
+model = FastModel.get_peft_model(
+    model,
+    finetune_vision_layers     = False, # Turn off for just text!
+    finetune_language_layers   = True,  # Should leave on!
+    finetune_attention_modules = True,  # Attention good for GRPO
+    finetune_mlp_modules       = True,  # Should leave on always!
+
+    r = 8,           # Larger = higher accuracy, but might overfit
+    lora_alpha = 8,  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
+# 
+# ```
+# <bos><|turn>user
+# Hello<turn|>
+# <|turn>model
+# Hey there!<turn|>
+# ```
+# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
+
+# In[8]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4-thinking",
+)
+
+
+# We get the first 3000 rows of the dataset
+
+# In[9]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
+
+
+# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
+
+# In[10]:
+
+
+from unsloth.chat_templates import standardize_data_formats
+dataset = standardize_data_formats(dataset)
+
+
+# Let's see how row 100 looks like!
+
+# In[11]:
+
+
+dataset[100]
+
+
+# We now have to apply the chat template for `Gemma-4` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
+
+# In[12]:
+
+
+def formatting_prompts_func(examples):
+   convos = examples["conversations"]
+   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
+   return { "text" : texts, }
+
+dataset = dataset.map(formatting_prompts_func, batched = True)
+
+
+# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
+
+# In[13]:
+
+
+dataset[100]["text"]
+
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
+
+# In[14]:
+
+
+from trl import SFTTrainer, SFTConfig
+trainer = SFTTrainer(
+    model = model,
+    tokenizer = tokenizer,
+    train_dataset = dataset,
+    eval_dataset = None, # Can set up evaluation!
+    args = SFTConfig(
+        dataset_text_field = "text",
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
+        warmup_steps = 5,
+        # num_train_epochs = 1, # Set this for 1 full training run.
+        max_steps = 60,
+        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
+        logging_steps = 1,
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "linear",
+        seed = 3407,
+        report_to = "none", # Use TrackIO/WandB etc
+    ),
+)
+
+
+# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
+
+# In[15]:
+
+
+from unsloth.chat_templates import train_on_responses_only
+trainer = train_on_responses_only(
+    trainer,
+    instruction_part = "<|turn>user\n",
+    response_part = "<|turn>model\n",
+)
+
+
+# Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!
+
+# In[16]:
+
+
+tokenizer.decode(trainer.train_dataset[100]["input_ids"])
+
+
+# Now let's print the masked out example - you should see only the answer is present:
+
+# In[17]:
+
+
+tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
+
+
+# In[18]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# # Let's train the model!
+# 
+# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
+
+# In[19]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[20]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[21]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4-thinking",
+)
+messages = [{
+    "role": "user",
+    "content": [{
+        "type" : "text",
+        "text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
+    }]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+outputs = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    use_cache = True,
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+)
+tokenizer.batch_decode(outputs)
+
+
+#  You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
+
+# In[22]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "Why is the sky blue?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    use_cache = True,
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[23]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+tokenizer.save_pretrained("gemma_4_lora")
+# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[24]:
+
+
+if False:
+    from unsloth import FastModel
+    model, tokenizer = FastModel.from_pretrained(
+        model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
+        max_seq_length = 2048,
+        load_in_4bit = True,
+    )
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "What is Gemma-4?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 128, # Increase for longer outputs!
+    use_cache = True,
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
+
+# In[25]:
+
+
+if False: # Change to True to save finetune!
+    model.save_pretrained_merged("gemma-4-finetune", tokenizer)
+
+
+# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[26]:
+
+
+if False: # Change to True to upload finetune
+    model.push_to_hub_merged(
+        "HF_ACCOUNT/gemma-4-finetune", tokenizer,
+        token = "YOUR_HF_TOKEN"
+    )
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
+
+# In[27]:
+
+
+if False: # Change to True to save to GGUF
+    model.save_pretrained_gguf(
+        "gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
+    )
+
+
+# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[28]:
+
+
+if False: # Change to True to upload GGUF
+    model.push_to_hub_gguf(
+        "HF_ACCOUNT/gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,448 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab A100 instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+
+# In[3]:
+
+
+from unsloth import FastVisionModel # FastLanguageModel for LLMs
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, processor = FastVisionModel.from_pretrained(
+    "unsloth/gemma-4-31B-it",
+    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
+    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
+)
+
+
+# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
+# 
+# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
+
+# In[4]:
+
+
+model = FastVisionModel.get_peft_model(
+    model,
+    finetune_vision_layers     = True, # False if not finetuning vision layers
+    finetune_language_layers   = True, # False if not finetuning language layers
+    finetune_attention_modules = True, # False if not finetuning attention layers
+    finetune_mlp_modules       = True, # False if not finetuning MLP layers
+
+    r = 32,                           # The larger, the higher the accuracy, but might overfit
+    lora_alpha = 32,                  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+    use_rslora = False,               # We support rank stabilized LoRA
+    loftq_config = None,               # And LoftQ
+    target_modules = "all-linear",    # Optional now! Can specify a list if needed
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
+# 
+# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
+
+# In[5]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
+
+
+# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
+
+# In[6]:
+
+
+dataset
+
+
+# In[7]:
+
+
+dataset[2]["image"]
+
+
+# In[8]:
+
+
+dataset[2]["text"]
+
+
+# We can also render LaTeX directly in the browser!
+
+# In[9]:
+
+
+from IPython.display import display, Math, Latex
+
+latex = dataset[3]["text"]
+display(Math(latex))
+
+
+# To format the dataset, all vision fine-tuning tasks should follow this format:
+# 
+# ```python
+# [
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+# ]
+# ```
+
+# In[10]:
+
+
+instruction = "Write the LaTeX representation for this image."
+
+def convert_to_conversation(sample):
+    conversation = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": instruction},
+                {"type": "image", "image": sample["image"]},
+            ],
+        },
+        {"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
+    ]
+    return {"messages": conversation}
+pass
+
+
+# Let's convert the dataset into the "correct" format for finetuning:
+
+# In[11]:
+
+
+converted_dataset = [convert_to_conversation(sample) for sample in dataset]
+
+
+# The first example is now structured like below:
+
+# In[12]:
+
+
+converted_dataset[0]
+
+
+# Lets take the Gemma 4 instruction chat template and use it in our base model
+
+# In[13]:
+
+
+from unsloth import get_chat_template
+
+processor = get_chat_template(
+    processor,
+    "gemma-4-thinking"
+)
+
+
+# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
+
+# In[14]:
+
+
+image = dataset[2]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# You can see it's absolutely terrible! It doesn't follow instructions at all
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
+# 
+# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
+
+# In[15]:
+
+
+from unsloth.trainer import UnslothVisionDataCollator
+from trl import SFTTrainer, SFTConfig
+
+trainer = SFTTrainer(
+    model = model,
+    train_dataset = converted_dataset,
+    processing_class = processor.tokenizer,
+    data_collator = UnslothVisionDataCollator(model, processor),
+    args = SFTConfig(
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4,
+        max_grad_norm = 0.3,
+        warmup_ratio = 0.03,
+        max_steps = 60,
+        # num_train_epochs = 2, # Set this instead of max_steps for full training runs
+        learning_rate = 2e-4,
+        logging_steps = 1,
+        save_strategy = "steps",
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "cosine",
+        seed = 3407,
+        output_dir = "outputs",
+        report_to = "none", # For Weights and Biases or others
+
+        # You MUST put the below items for vision finetuning:
+        remove_unused_columns = False,
+        dataset_text_field = "",
+        dataset_kwargs = {"skip_prepare_dataset": True},
+        max_length = 2048,
+    )
+)
+
+
+# In[16]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# In[17]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[18]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model! You can modify the instruction and input—just leave the output blank.
+# 
+# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
+
+# In[19]:
+
+
+image = dataset[10]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, use Hugging Face’s `push_to_hub` for online saving, or `save_pretrained` for local storage.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[20]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+processor.save_pretrained("gemma_4_lora")
+# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[21]:
+
+
+if False:
+    from unsloth import FastVisionModel
+
+    model, processor = FastVisionModel.from_pretrained(
+        model_name = "gemma_4_lora",  # YOUR MODEL YOU USED FOR TRAINING
+        load_in_4bit = True,  # Set to False for 16bit LoRA
+    )
+
+sample = dataset[1]
+image = sample["image"].convert("RGB")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "text",
+                "text": sample["text"],
+            },
+            {
+                "type": "image",
+            },
+        ],
+    },
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
+_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                   use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
+
+# In[22]:
+
+
+# Select ONLY 1 to save! (Both not needed!)
+
+# Save locally to 16bit
+if False: model.save_pretrained_merged("unsloth_finetune", processor,)
+
+# To export and save to your Hugging Face account
+if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
+
+
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,478 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+# 
+# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
+
+# In[3]:
+
+
+from unsloth import FastModel
+import torch
+from huggingface_hub import snapshot_download
+
+fourbit_models = [
+    # Gemma 4 models
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B-it",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, processor = FastModel.from_pretrained(
+    model_name = "unsloth/gemma-4-E2B-it",
+    dtype = None, # None for auto detection
+    max_seq_length = 8192, # Choose any for long context!
+    load_in_4bit = False,  # 4 bit quantization to reduce memory
+    full_finetuning = False, # [NEW!] We have full finetuning now!
+    # token = "YOUR_HF_TOKEN", # HF Token for gated models
+)
+
+
+# # Gemma 4 can process Text, Vision and Audio!
+# 
+# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
+
+# In[4]:
+
+
+from transformers import TextStreamer
+# Helper function for inference
+def do_gemma_4_inference(messages, max_new_tokens = 128):
+    _ = model.generate(
+        **processor.apply_chat_template(
+            messages,
+            add_generation_prompt = True, # Must add for generation
+            tokenize = True,
+            return_dict = True,
+            return_tensors = "pt",
+        ).to("cuda"),
+        max_new_tokens = max_new_tokens,
+        do_sample = False,
+        streamer = TextStreamer(processor, skip_prompt = True),
+    )
+
+
+# <h3>Let's Evaluate Gemma 4 Baseline Performance on German Transcription</h2>
+
+# In[5]:
+
+
+from datasets import load_dataset,Audio,concatenate_datasets
+
+dataset = load_dataset("kadirnar/Emilia-DE-B000000", split = "train")
+
+# Select a single audio sample to reserve for testing.
+# This index is chosen from the full dataset before we create the smaller training split.
+test_audio = dataset[7546]
+
+dataset = dataset.select(range(3000))
+
+dataset = dataset.cast_column("audio", Audio(sampling_rate = 16000))
+
+
+# In[6]:
+
+
+from IPython.display import Audio, display
+print(test_audio['text'])
+Audio(test_audio['audio']['array'],rate = test_audio['audio']['sampling_rate'])
+
+
+# And the translation of the audio from German to English is:
+# 
+# > I—I hold myself directly accountable. That much is, of course, clear: namely, that there are political interests involved in trade—in the exchange of goods—and that political influences are at play. The question is: that should not be the alternative.
+
+# In[7]:
+
+
+messages = [
+    {
+        "role": "system",
+        "content": [
+            {
+                "type": "text",
+                "text": "You are an assistant that transcribes speech accurately.",
+            }
+        ],
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "audio", "audio": test_audio['audio']['array']},
+            {"type": "text", "text": "Please transcribe this audio."}
+        ]
+    }
+]
+
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# <h3>Baseline Model Performance: 32.43% Word Error Rate (WER) for this sample !</h3>
+
+# # Let's finetune Gemma 4!
+# 
+# You can finetune the vision and text and audio parts
+
+# We now add LoRA adapters so we only need to update a small amount of parameters!
+
+# In[8]:
+
+
+model = FastModel.get_peft_model(
+    model,
+    finetune_vision_layers     = False, # False if not finetuning vision layers
+    finetune_language_layers   = True,  # False if not finetuning language layers
+    finetune_attention_modules = True,  # False if not finetuning attention layers
+    finetune_mlp_modules       = True,  # False if not finetuning MLP layers
+
+    r = 8,                              # The larger, the higher the accuracy, but might overfit
+    lora_alpha = 16,                    # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+    use_rslora = False,                 # We support rank stabilized LoRA
+    loftq_config = None,                # And LoftQ
+    target_modules = [
+        "q_proj", "k_proj", "v_proj", "o_proj",
+        "gate_proj", "up_proj", "down_proj",
+
+        # Audio layers
+        "post", "linear_start", "linear_end",
+        "embedding_projection",
+        "ffw_layer_1", "ffw_layer_2",
+        "output_proj",
+    ]
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We adapt the `kadirnar/Emilia-DE-B000000` dataset for our German ASR task using Gemma 4 multi-modal chat format. Each audio-text pair is structured into a conversation with `system`, `user`, and `assistant` roles. The processor then converts this into the final training format:
+# 
+# ```
+# <bos><|turn>system
+# You are an assistant that transcribes speech accurately.<turn|>
+# <|turn>user
+# <|audio|>Please transcribe this audio.<turn|>
+# <|turn>model
+# Ich, ich rechne direkt mich an.<turn|>
+
+# In[9]:
+
+
+def format_intersection_data(samples: dict) -> dict[str, list]:
+    """Format intersection dataset to match expected message format"""
+    formatted_samples = {"messages": []}
+    for idx in range(len(samples["audio"])):
+        audio = samples["audio"][idx]["array"]
+        label = str(samples["text"][idx])
+
+        message = [
+            {
+                "role": "system",
+                "content": [
+                    {
+                        "type": "text",
+                        "text": "You are an assistant that transcribes speech accurately.",
+                    }
+                ],
+            },
+            {
+                "role": "user",
+                "content": [
+                    {"type": "audio", "audio": audio},
+                    {"type": "text", "text": "Please transcribe this audio."}
+                ]
+            },
+            {
+                "role": "assistant",
+                "content":[{"type": "text", "text": label}]
+            }
+        ]
+        formatted_samples["messages"].append(message)
+    return formatted_samples
+
+
+# In[10]:
+
+
+dataset = dataset.map(format_intersection_data, batched = True, batch_size = 4, num_proc = 4)
+
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
+
+# In[11]:
+
+
+# Use UnslothVisionDataCollator which handles audio token alignment correctly
+from unsloth.trainer import UnslothVisionDataCollator
+from trl import SFTTrainer, SFTConfig
+
+trainer = SFTTrainer(
+    model = model,
+    train_dataset = dataset,
+    processing_class = processor.tokenizer,
+    data_collator = UnslothVisionDataCollator(model, processor),
+    args = SFTConfig(
+        per_device_train_batch_size = 8,
+        gradient_accumulation_steps = 1,
+        warmup_ratio = 0.03,
+        # num_train_epochs = 1, # Use for full training runs
+        max_steps = 60,
+        learning_rate = 5e-5,
+        logging_steps = 1,
+        save_strategy = "steps",
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "cosine",
+        seed = 3407,
+        output_dir = "outputs",
+        report_to = "none",
+        remove_unused_columns = False,
+
+        # The below are a must for audio finetuning:
+        dataset_text_field = "",
+        dataset_kwargs = {"skip_prepare_dataset": True},
+        max_length = 8192,
+    )
+)
+
+
+# In[12]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# # Let's train the model!
+# 
+# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
+
+# In[13]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[14]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
+
+# In[15]:
+
+
+messages = [
+    {
+        "role": "system",
+        "content": [
+            {
+                "type": "text",
+                "text": "You are an assistant that transcribes speech accurately.",
+            }
+        ],
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "audio", "audio": test_audio['audio']['array']},
+            {"type": "text", "text": "Please transcribe this audio."}
+        ]
+    }
+]
+
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[16]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+processor.save_pretrained("gemma_4_lora")
+# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# processor.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[17]:
+
+
+if False:
+    from unsloth import FastModel
+    model, processor = FastModel.from_pretrained(
+        model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
+        max_seq_length = 2048,
+        load_in_4bit = True,
+    )
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "What is Gemma-4?",}]
+}]
+inputs = processor.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 128, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(processor, skip_prompt = True),
+)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
+
+# In[18]:
+
+
+if False: # Change to True to save finetune!
+    model.save_pretrained_merged("gemma-4", processor)
+
+
+# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[19]:
+
+
+if False: # Change to True to upload finetune
+    model.push_to_hub_merged(
+        "HF_ACCOUNT/gemma-4-finetune", processor,
+        token = "YOUR_HF_TOKEN"
+    )
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
+
+# In[20]:
+
+
+if False: # Change to True to save to GGUF
+    model.save_pretrained_gguf(
+        "gemma_4_finetune",
+        processor,
+        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
+    )
+
+
+# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[21]:
+
+
+if False: # Change to True to upload GGUF
+    model.push_to_hub_gguf(
+        "HF_ACCOUNT/gemma_4_finetune",
+        processor,
+        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,556 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+# 
+# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
+
+# In[3]:
+
+
+from unsloth import FastModel
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, tokenizer = FastModel.from_pretrained(
+    model_name = "unsloth/gemma-4-E2B-it",
+    dtype = None, # None for auto detection
+    max_seq_length = 1024, # Choose any for long context!
+    load_in_4bit = False,  # 4 bit quantization to reduce memory
+    full_finetuning = False, # [NEW!] We have full finetuning now!
+    # token = "YOUR_HF_TOKEN", # HF Token for gated models
+)
+
+
+# # Gemma 4 can process Text, Vision and Audio!
+# 
+# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[4]:
+
+
+from transformers import TextStreamer
+# Helper function for inference
+def do_gemma_4_inference(messages, max_new_tokens = 128):
+    _ = model.generate(
+        **tokenizer.apply_chat_template(
+            messages,
+            add_generation_prompt = True, # Must add for generation
+            tokenize = True,
+            return_dict = True,
+            return_tensors = "pt",
+        ).to("cuda"),
+        max_new_tokens = max_new_tokens,
+        temperature = 1.0, top_p = 0.95, top_k = 64,
+        streamer = TextStreamer(tokenizer, skip_prompt = True)
+    )
+
+
+# # Gemma 4 can see images!
+# 
+# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
+
+# In[5]:
+
+
+sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "image", "image" : sloth_link },
+        { "type": "text",  "text" : "Which films does this animal feature in?" }
+    ]
+}]
+# You might have to wait 1 minute for Unsloth's auto compiler
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# Let's make a poem about sloths!
+
+# In[6]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{ "type" : "text",
+                  "text" : "Write a poem about sloths." }]
+}]
+do_gemma_4_inference(messages)
+
+
+# # Gemma 4 can also hear!
+
+# In[7]:
+
+
+from IPython.display import Audio, display
+Audio("https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3")
+
+
+# In[8]:
+
+
+get_ipython().system('wget -qqq https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3 -O audio.mp3')
+
+
+# In[9]:
+
+
+audio_file = "audio.mp3"
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "audio", "audio" : audio_file },
+        { "type": "text",  "text" : "What is this audio about?" }
+    ]
+}]
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# # Let's combine all 3 modalities together!
+
+# In[10]:
+
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "audio", "audio" : audio_file },
+        { "type": "image", "image" : sloth_link },
+        { "type": "text",  "text" : "What is this audio and image about? "\
+                                    "How are they related?" }
+    ]
+}]
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# # Let's finetune Gemma 4!
+# 
+# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
+
+# We now add LoRA adapters so we only need to update a small amount of parameters!
+
+# In[11]:
+
+
+model = FastModel.get_peft_model(
+    model,
+    finetune_vision_layers     = False, # Turn off for just text!
+    finetune_language_layers   = True,  # Should leave on!
+    finetune_attention_modules = True,  # Attention good for GRPO
+    finetune_mlp_modules       = True,  # Should leave on always!
+
+    r = 8,           # Larger = higher accuracy, but might overfit
+    lora_alpha = 8,  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
+# 
+# ```
+# <bos><|turn>user
+# Hello<turn|>
+# <|turn>model
+# Hey there!<turn|>
+# ```
+# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
+
+# In[12]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4",
+)
+
+
+# We get the first 3000 rows of the dataset
+
+# In[13]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
+
+
+# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
+
+# In[14]:
+
+
+from unsloth.chat_templates import standardize_data_formats
+dataset = standardize_data_formats(dataset)
+
+
+# Let's see how row 100 looks like!
+
+# In[15]:
+
+
+dataset[100]
+
+
+# We now have to apply the chat template for `Gemma-4` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
+
+# In[16]:
+
+
+def formatting_prompts_func(examples):
+   convos = examples["conversations"]
+   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
+   return { "text" : texts, }
+
+dataset = dataset.map(formatting_prompts_func, batched = True)
+
+
+# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
+
+# In[17]:
+
+
+dataset[100]["text"]
+
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
+
+# In[18]:
+
+
+from trl import SFTTrainer, SFTConfig
+trainer = SFTTrainer(
+    model = model,
+    tokenizer = tokenizer,
+    train_dataset = dataset,
+    eval_dataset = None, # Can set up evaluation!
+    args = SFTConfig(
+        dataset_text_field = "text",
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
+        warmup_steps = 5,
+        # num_train_epochs = 1, # Set this for 1 full training run.
+        max_steps = 60,
+        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
+        logging_steps = 1,
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "linear",
+        seed = 3407,
+        report_to = "none", # Use TrackIO/WandB etc
+    ),
+)
+
+
+# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
+
+# In[19]:
+
+
+from unsloth.chat_templates import train_on_responses_only
+trainer = train_on_responses_only(
+    trainer,
+    instruction_part = "<|turn>user\n",
+    response_part = "<|turn>model\n",
+)
+
+
+# Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!
+
+# In[20]:
+
+
+tokenizer.decode(trainer.train_dataset[100]["input_ids"])
+
+
+# Now let's print the masked out example - you should see only the answer is present:
+
+# In[21]:
+
+
+tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
+
+
+# In[22]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# # Let's train the model!
+# 
+# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
+
+# In[23]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[24]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[25]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4",
+)
+messages = [{
+    "role": "user",
+    "content": [{
+        "type" : "text",
+        "text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
+    }]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+outputs = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+)
+tokenizer.batch_decode(outputs)
+
+
+#  You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
+
+# In[26]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "Why is the sky blue?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[27]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+tokenizer.save_pretrained("gemma_4_lora")
+# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[28]:
+
+
+if False:
+    from unsloth import FastModel
+    model, tokenizer = FastModel.from_pretrained(
+        model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
+        max_seq_length = 2048,
+        load_in_4bit = True,
+    )
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "What is Gemma-4?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 128, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
+
+# In[29]:
+
+
+if False: # Change to True to save finetune!
+    model.save_pretrained_merged("gemma-4-finetune", tokenizer)
+
+
+# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[30]:
+
+
+if False: # Change to True to upload finetune
+    model.push_to_hub_merged(
+        "HF_ACCOUNT/gemma-4-finetune", tokenizer,
+        token = "YOUR_HF_TOKEN"
+    )
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
+
+# In[31]:
+
+
+if False: # Change to True to save to GGUF
+    model.save_pretrained_gguf(
+        "gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
+    )
+
+
+# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[32]:
+
+
+if False: # Change to True to upload GGUF
+    model.push_to_hub_gguf(
+        "HF_ACCOUNT/gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,448 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[ ]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[ ]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+
+# In[ ]:
+
+
+from unsloth import FastVisionModel # FastLanguageModel for LLMs
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, processor = FastVisionModel.from_pretrained(
+    "unsloth/gemma-4-E2B-it",
+    load_in_4bit = False, # Use 4bit to reduce memory use. False for 16bit LoRA.
+    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
+)
+
+
+# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
+# 
+# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
+
+# In[ ]:
+
+
+model = FastVisionModel.get_peft_model(
+    model,
+    finetune_vision_layers     = True, # False if not finetuning vision layers
+    finetune_language_layers   = True, # False if not finetuning language layers
+    finetune_attention_modules = True, # False if not finetuning attention layers
+    finetune_mlp_modules       = True, # False if not finetuning MLP layers
+
+    r = 32,                           # The larger, the higher the accuracy, but might overfit
+    lora_alpha = 32,                  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+    use_rslora = False,               # We support rank stabilized LoRA
+    loftq_config = None,               # And LoftQ
+    target_modules = "all-linear",    # Optional now! Can specify a list if needed
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
+# 
+# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
+
+# In[ ]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
+
+
+# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
+
+# In[ ]:
+
+
+dataset
+
+
+# In[ ]:
+
+
+dataset[2]["image"]
+
+
+# In[ ]:
+
+
+dataset[2]["text"]
+
+
+# We can also render LaTeX directly in the browser!
+
+# In[ ]:
+
+
+from IPython.display import display, Math, Latex
+
+latex = dataset[3]["text"]
+display(Math(latex))
+
+
+# To format the dataset, all vision fine-tuning tasks should follow this format:
+# 
+# ```python
+# [
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+# ]
+# ```
+
+# In[ ]:
+
+
+instruction = "Write the LaTeX representation for this image."
+
+def convert_to_conversation(sample):
+    conversation = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": instruction},
+                {"type": "image", "image": sample["image"]},
+            ],
+        },
+        {"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
+    ]
+    return {"messages": conversation}
+pass
+
+
+# Let's convert the dataset into the "correct" format for finetuning:
+
+# In[ ]:
+
+
+converted_dataset = [convert_to_conversation(sample) for sample in dataset]
+
+
+# The first example is now structured like below:
+
+# In[ ]:
+
+
+converted_dataset[0]
+
+
+# Lets take the Gemma 4 instruction chat template and use it in our base model
+
+# In[ ]:
+
+
+from unsloth import get_chat_template
+
+processor = get_chat_template(
+    processor,
+    "gemma-4"
+)
+
+
+# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
+
+# In[ ]:
+
+
+image = dataset[2]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# You can see it's absolutely terrible! It doesn't follow instructions at all
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
+# 
+# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
+
+# In[ ]:
+
+
+from unsloth.trainer import UnslothVisionDataCollator
+from trl import SFTTrainer, SFTConfig
+
+trainer = SFTTrainer(
+    model = model,
+    train_dataset = converted_dataset,
+    processing_class = processor.tokenizer,
+    data_collator = UnslothVisionDataCollator(model, processor),
+    args = SFTConfig(
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4,
+        max_grad_norm = 0.3,
+        warmup_ratio = 0.03,
+        max_steps = 60,
+        # num_train_epochs = 2, # Set this instead of max_steps for full training runs
+        learning_rate = 2e-4,
+        logging_steps = 1,
+        save_strategy = "steps",
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "cosine",
+        seed = 3407,
+        output_dir = "outputs",
+        report_to = "none", # For Weights and Biases or others
+
+        # You MUST put the below items for vision finetuning:
+        remove_unused_columns = False,
+        dataset_text_field = "",
+        dataset_kwargs = {"skip_prepare_dataset": True},
+        max_length = 2048,
+    )
+)
+
+
+# In[ ]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# In[ ]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[ ]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model! You can modify the instruction and input—just leave the output blank.
+# 
+# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
+
+# In[ ]:
+
+
+image = dataset[10]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, use Hugging Face’s `push_to_hub` for online saving, or `save_pretrained` for local storage.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[ ]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+processor.save_pretrained("gemma_4_lora")
+# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[ ]:
+
+
+if False:
+    from unsloth import FastVisionModel
+
+    model, processor = FastVisionModel.from_pretrained(
+        model_name = "gemma_4_lora",  # YOUR MODEL YOU USED FOR TRAINING
+        load_in_4bit = True,  # Set to False for 16bit LoRA
+    )
+
+sample = dataset[1]
+image = sample["image"].convert("RGB")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "text",
+                "text": sample["text"],
+            },
+            {
+                "type": "image",
+            },
+        ],
+    },
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
+_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                   use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
+
+# In[ ]:
+
+
+# Select ONLY 1 to save! (Both not needed!)
+
+# Save locally to 16bit
+if False: model.save_pretrained_merged("unsloth_finetune", processor,)
+
+# To export and save to your Hugging Face account
+if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
+
+
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,911 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# # ### Installation
+# 
+# # In[ ]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[ ]:
+# 
+# 
+# #@title Colab Extra Install { display-mode: "form" }
+# get_ipython().run_line_magic('%capture', '')
+# import os
+# get_ipython().system('pip install --upgrade -qqq uv')
+# if "COLAB_" not in "".join(os.environ.keys()):
+#     # If you're not in Colab, just use pip install!
+#     get_ipython().system('pip install unsloth vllm')
+# else:
+#     try: import numpy, PIL; _numpy = f'numpy=={numpy.__version__}'; _pil = f'pillow=={PIL.__version__}'
+#     except: _numpy = "numpy"; _pil = "pillow"
+#     try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
+#     except: is_t4 = False
+#     _vllm, _triton = ('vllm==0.9.2', 'triton==3.2.0') if is_t4 else ('vllm==0.15.1', 'triton')
+#     get_ipython().system('uv pip install -qqq --upgrade {_vllm} {_numpy} {_pil} torchvision bitsandbytes xformers unsloth')
+#     get_ipython().system('uv pip install -qqq {_triton}')
+# get_ipython().system('uv pip install transformers==4.56.2')
+# get_ipython().system('uv pip install --no-deps trl==0.22.2')
+# 
+# 
+# # ### Unsloth
+
+# # Goal: Make faster kernels with Reinforcement Learning
+# 
+# Our goal is to make a faster matrix multiplication kernel by doing RL on Gemma 4 with Unsloth.
+# 
+# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/Matrix_multiplication_qtl1.svg/500px-Matrix_multiplication_qtl1.svg.png" height=200 />
+# 
+# You will learn how to:
+# 1. Counteract **reward hacking** like cheating, caching, laziness.
+# 2. Timing and correctness of kernels and time limits.
+# 3. Making good **reward functions**
+# 4. How to seriously do RL to make optimized kernels
+
+# In[ ]:
+
+
+from unsloth import FastVisionModel
+import torch
+max_seq_length = 4096 # Can increase for longer reasoning traces
+lora_rank = 32 # Larger rank = smarter, but slower
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, tokenizer = FastVisionModel.from_pretrained(
+    model_name = "unsloth/gemma-4-E2B-it",
+    max_seq_length = max_seq_length,
+    load_in_4bit = False, # False for LoRA 16bit
+    fast_inference = False, # Enable vllm fast inference
+)
+
+
+# We now add some small amount of LoRA weights to Gemma 4 so we only need to train those, instead of training on the full model.
+
+# In[ ]:
+
+
+model = FastVisionModel.get_peft_model(
+    model,
+    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
+    target_modules = [
+        "q_proj", "k_proj", "v_proj", "o_proj",
+        "gate_proj", "up_proj", "down_proj",
+    ],
+    lora_alpha = lora_rank*2, # *2 speeds up training
+    use_gradient_checkpointing = "unsloth", # Reduces memory usage
+    random_state = 3407,
+)
+
+
+# # Optimized matrix multiplication
+# 
+# Numpy has optimized matrix multiplication kernels for CPUs via BLAS optimized operations. For GPUs, one can use CUDA accelerated cuBLAS kernels which PyTorch calls under the hood.
+# 
+# To generate some random matrices to do matrix multiplication, we can do the below:
+
+# In[ ]:
+
+
+import numpy as np
+def generate_random_matrices(seed = 3407, n = 256):
+    random_state = np.random.RandomState(seed)
+    n, k, m = random_state.randint(1, n+1, size = 3)
+    A = np.random.uniform(-10, 10, size = (n, k))
+    B = np.random.uniform(-10, 10, size = (k, m))
+    return A, A.tolist(), B, B.tolist()
+
+
+# We shall generate a small matrix, and see the matrix multiplied output
+
+# In[ ]:
+
+
+A, A_list, B, B_list = generate_random_matrices(seed = 42, n = 5)
+print(A)
+print(B)
+print(np.matmul(A, B))
+
+
+# We can call a LLM to generate a simple matrix multiply kernel in Python only, and we can calculate the differences between the actual result and the kernel's result
+
+# In[ ]:
+
+
+def calculate_difference(pred, real):
+    if pred is None: return 5, 5
+    assert real is not None
+    import numpy as np
+    try:
+        difference = pred - real
+    except:
+        return 5, 5
+    amax_error = float(np.amax(difference))
+    mse_error  = float(np.mean(np.square(difference)))
+    return amax_error, mse_error
+
+
+# In[ ]:
+
+
+# Kernel generated by GPT-5
+def matmul(A, B):
+    z, s = zip, sum
+    Bt = list(z(*B))
+    return [[s(a*b for a, b in z(row, col)) for col in Bt] for row in A]
+
+
+# We see the error below is very small, so that's good!
+
+# In[ ]:
+
+
+prediction = matmul(A_list, B_list)
+calculate_difference(prediction, np.matmul(A, B))
+
+
+# # Countering Reward Hacking
+# 
+# The ultimate goal of RL is to maximize some reward (say speed, revenue, some metric).
+# 
+# But RL can **cheat** When the RL algorithm learns a trick or exploits something to increase the reward, without actually doing the task at end, this is called "Reward Hacking".
+# 
+# Some good examples are in https://en.wikipedia.org/wiki/Reward_hacking
+# 
+# For matrix multiplication kernels, we might see the following issues:
+# 
+# * Laziness: RL learns to use Numpy, Torch, other libraries, which calls optimized kernels.
+# * Caching: RL learns to cache the result of the output
+# * Cheating: RL learns to find the actual output by inspecting Python global variables
+# * RL learns to edit the timing function to make it output 0 time as passed.
+# 
+# And possibly more. We shall try to address each!
+
+# # Countering Reward Hacking 1: Stop laziness
+# We can stop the RL algorithm from calling optimized code by inspecting if the generated code imports other non standard Python libraries. We used GPT-5 to help generate this check `check_only_stdlib_imports`:
+
+# In[ ]:
+
+
+#@title (Collapsible code)
+import ast
+import sys
+import sysconfig
+from pathlib import Path
+
+def _stdlib_names():
+    """
+    Build a set of canonical stdlib top-level module/package names.
+    Uses sys.stdlib_module_names when available (3.10+), with a
+    filesystem fallback for older versions/edge cases.
+    """
+    names = {m.lower() for m in getattr(sys, "stdlib_module_names", set())}
+    names |= {m.lower() for m in sys.builtin_module_names}
+    names.add("__future__")  # special-case
+
+    # Fallback/augmentation: scan the stdlib directory
+    try:
+        stdlib_dir = Path(sysconfig.get_path("stdlib"))
+        if stdlib_dir.exists():
+            for p in stdlib_dir.iterdir():
+                if p.name == "site-packages":
+                    continue
+                if p.suffix == ".py":
+                    names.add(p.stem.lower())
+                elif p.is_dir() and (p / "__init__.py").exists():
+                    names.add(p.name.lower())
+    except Exception:
+        # conservative fallback; the names set above will still work well
+        pass
+
+    return names
+
+_STDLIB_SET = _stdlib_names()
+
+def check_only_stdlib_imports(code: str):
+    """
+    Return (ok: bool, details: dict)
+
+    ok == True  -> all absolute imports are from the stdlib.
+    ok == False -> details['non_stdlib'] lists offending top-level modules.
+
+    details includes:
+      - stdlib: sorted list of stdlib imports found
+      - non_stdlib: sorted list of non-stdlib imports found
+      - relative_imports: count of relative imports (always allowed here)
+    """
+    try:
+        tree = ast.parse(code)
+    except SyntaxError as e:
+        return False, {
+            "error": f"SyntaxError: {e}",
+            "stdlib": [],
+            "non_stdlib": [],
+            "relative_imports": 0,
+        }
+
+    abs_imports = set()
+    relative_count = 0
+
+    class Visitor(ast.NodeVisitor):
+        def visit_Import(self, node: ast.Import):
+            for alias in node.names:
+                abs_imports.add(alias.name.split(".")[0])
+        def visit_ImportFrom(self, node: ast.ImportFrom):
+            nonlocal relative_count
+            if (node.level or 0) > 0:
+                # relative import
+                relative_count += 1
+            else:
+                if node.module:
+                    abs_imports.add(node.module.split(".")[0])
+
+    Visitor().visit(tree)
+
+    stdlib_found = sorted(m for m in abs_imports if m.lower() in _STDLIB_SET)
+    non_stdlib = sorted(m for m in abs_imports if m.lower() not in _STDLIB_SET)
+
+    return len(non_stdlib) == 0, {
+        "stdlib": stdlib_found,
+        "non_stdlib": non_stdlib,
+        "relative_imports": relative_count,
+    }
+
+
+# For example, let's call `check_only_stdlib_imports` on a random piece of matrix multiplication code generated by GPT-5:
+
+# In[ ]:
+
+
+sample = """
+def matmul(A, B):
+    import numpy as np
+    from torch import matmul
+    z, s = zip, sum
+    Bt = list(z(*B))
+    return [[s(a*b for a, b in z(row, col)) for col in Bt] for row in A]
+"""
+ok, info = check_only_stdlib_imports(sample)
+print("Only stdlib imports?", ok)
+print(info)
+
+
+# # Countering Reward Hacking 2: Stop cheating
+# We can stop the RL algorithm from using global or cached variables by restricting it's `locals` and `globals`.
+# 
+# We are also going to use `exec` to create the function, so we have to save the output to an empty dict.
+# 
+# We also disallow global variable access.
+
+# In[ ]:
+
+
+output_function = {}
+exec(sample, {}, output_function)
+output_function["matmul"]
+
+
+# We also disallow global variable access via `types.FunctionType(f.__code__, {})`
+
+# In[ ]:
+
+
+import types
+output_function["matmul"] = types.FunctionType(output_function["matmul"].__code__, {})
+
+def import_numpy():
+    np.matmul
+    print("Success")
+
+import_numpy()
+import_numpy = types.FunctionType(import_numpy.__code__, {})
+try:
+    import_numpy()
+except Exception as e:
+    print(str(e))
+
+
+# In[ ]:
+
+
+def create_locked_down_function(function):
+    output_function = {}
+    exec(function, {}, output_function)
+    new_matmul = output_function["matmul"]
+    new_matmul = types.FunctionType(new_matmul.__code__, {})
+    return new_matmul
+
+
+# # Countering Reward Hacking 3: Stop caching
+# We can stop the RL algorithm from using cached data by wiping the cache with a large fake matrix. We also have to benchmark carefully with multiple loops and turns.
+# 
+# We also add a **timer** to not make the algorithm go in an endless loop.
+
+# In[ ]:
+
+
+import os, gc, time, statistics
+import signal
+from contextlib import contextmanager
+class TimeoutError(Exception): pass
+
+@contextmanager
+def time_limit(seconds):
+    def _handler(signum, frame):
+        raise TimeoutError(f"Timed out after {seconds}s")
+    old = signal.signal(signal.SIGALRM, _handler)
+    signal.setitimer(signal.ITIMER_REAL, seconds)
+    try:
+        yield
+    finally:
+        signal.setitimer(signal.ITIMER_REAL, 0.0)
+        signal.signal(signal.SIGALRM, old)
+
+class Benchmarker:
+    def __init__(self, trials = 3, loops = 1, timeout = 30):
+        self.buffer = np.zeros(2 * 1024 * 1024 * 1024, dtype = np.uint8)
+        self.trials = trials
+        self.loops = loops
+        assert timeout > 0 # Cannot be 0 since it won't work!
+        self.timeout = timeout
+    def thrash(self):
+        # Edit the buffer to wipe cache lines
+        self.buffer ^= 1
+        return int(self.buffer[::4096].sum())
+
+    def benchmark(self, function, arguments):
+        assert len(arguments) == self.loops
+        samples = []
+        exceptions = []
+        timed_out = 0
+        for _ in range(self.trials):
+            gc.collect(); gc.disable(); self.thrash()
+            t_start = time.perf_counter_ns()
+            for i in range(self.loops):
+                try:
+                    with time_limit(self.timeout):
+                        function(*arguments[i])
+                except TimeoutError as e:
+                    timed_out += 1
+                except Exception as e:
+                    exceptions.append(str(e))
+            t_end = time.perf_counter_ns()
+            gc.enable()
+            samples.append((t_end - t_start) // max(1, self.loops))
+        return {
+            "median_ns": int(statistics.median(samples)),
+            "mean_ns": int(statistics.fmean(samples)),
+            "stdev_ns": int(statistics.pstdev(samples) if len(samples) > 1 else 0),
+            "exceptions" : exceptions,
+            "timeouts" : timed_out,
+        }
+
+
+# For example we use our matmul kernel we had, and benchmark it with a 10 second delay:
+
+# In[ ]:
+
+
+A, A_list, B, B_list = generate_random_matrices(seed = 0, n = 256)
+Benchmarker(trials = 1, timeout = 10).benchmark(output_function["matmul"], [(A_list, B_list)])
+
+
+# # Data & RL task setup
+# 
+# We now have to create a prompt to the model for which it will do some task. For our matrix multiply example, we use the below:
+
+# In[ ]:
+
+
+prompt = """
+Create a new fast matrix multiplication function using only native Python code.
+You are given a list of list of numbers.
+Output your new function in backticks using the format below:
+```python
+def matmul(A, B):
+    return ...
+```
+""".strip()
+print(prompt)
+
+
+# First, let's prompt Gemma 4 without RL and see how it goes:
+
+# In[ ]:
+
+
+text = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt.strip()}],
+    tokenize = False,
+    add_generation_prompt = True,
+)
+
+from transformers import TextStreamer
+print("=" * 50)
+print("BASE MODEL OUTPUT (before RL training):")
+print("=" * 50)
+
+inputs = tokenizer(
+    text = text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+text_streamer = TextStreamer(tokenizer, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# # Reward functions
+# 
+# We now design the `extract_function` function which simply extracts the function wrapped in 3 backticks.
+# 
+# And 4 reward functions:
+# 
+# 1. `function_works` which rewards the model if the strategy is a valid Python function.
+# 2. `no_cheating` which checks if the function imported other modules, and if it did, we penalize it.
+# 3. `correctness_check` which checks if the kernel was correct or wrong - it shouldn't generate gibberish!
+# 4. `speed_check` checks the performance relative to Numpy matmul directly.
+
+# In[ ]:
+
+
+def extract_function(text):
+    if text.count("```") >= 2:
+        first = text.find("```") + 3
+        second = text.find("```", first)
+        fx = text[first : second].strip()
+        fx = fx.removeprefix("python\n")
+        fx = fx[fx.find("def"):]
+        if fx.startswith("def matmul(A, B):"): return fx
+    return None
+print(extract_function(prompt))
+
+
+# Below is our `function_works` reward function which uses Python's `exec` but guarded by not allowing leakage of local and global variables. We can also use `check_only_stdlib_imports` first to check if there are errors before even executing the function:
+
+# In[ ]:
+
+
+ok, info = check_only_stdlib_imports("def a")
+ok, info
+
+
+# In[ ]:
+
+
+def function_works(completions, **kwargs):
+    scores = []
+    for completion in completions:
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+        print(function)
+        if function is not None:
+            ok, info = check_only_stdlib_imports(function)
+        if function is None or "error" in info:
+            score = -2.0
+        else:
+            try:
+                new_matmul = create_locked_down_function(function)
+                score = 1.0
+            except:
+                score = -0.5
+        scores.append(score)
+    return scores
+
+
+# `no_cheating` checks if the function cheated since it might have imported Numpy or Torch optimized code.
+
+# In[ ]:
+
+
+def no_cheating(completions, **kwargs):
+    scores = []
+    for completion in completions:
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+        if function is not None:
+            ok, info = check_only_stdlib_imports(function)
+        else:
+            ok = False
+        scores.append(1.0 if ok else -20.0) # Penalize heavily!
+    return scores
+
+
+# Next `correctness_check` checks if the kernel was correct. We want to penalize if the absolute error is larger than 1, and if the mean squared error is somewhat bigger then machine epsilon.
+# 
+# We have to execute the code now!
+
+# In[ ]:
+
+
+np.finfo(np.float64).eps
+
+
+# In[ ]:
+
+
+def correctness_check(completions, **kwargs):
+    scores = []
+    # Generate some random matrices of size less than 128
+    A, A_list, B, B_list = generate_random_matrices(seed = np.random.randint(10000), n = 128)
+    for completion in completions:
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+        if function is not None:
+            ok, info = check_only_stdlib_imports(function)
+        if function is None or "error" in info:
+            scores.append(0)
+            continue
+        try:
+            new_matmul = create_locked_down_function(function)
+        except:
+            scores.append(0)
+            continue
+        try:
+            pred = new_matmul(A_list.copy(), B_list.copy())
+        except:
+            # Failed!
+            scores.append(-2.0)
+            continue
+        true = np.matmul(A, B)
+        amax_error, mse_error = calculate_difference(pred, true)
+
+        # Check correctness and score!
+        machine_epsilon = 100*np.finfo(np.float64).eps
+        if   amax_error >= 3:   score = -3.0
+        elif amax_error >= 2:   score = -2.5
+        elif amax_error >= 1:   score = -2.0
+        elif amax_error >= 0.5: score = -1.0
+        elif amax_error >= 100*machine_epsilon: score = 0.0
+        elif amax_error >= machine_epsilon: score = 1.0
+        else: score = 3.0
+
+        if   mse_error >= 3:   score += -3.0
+        elif mse_error >= 2:   score += -2.5
+        elif mse_error >= 1:   score += -2.0
+        elif mse_error >= 0.5: score += -1.0
+        elif mse_error >= 100*machine_epsilon: score += 0.0
+        elif mse_error >= machine_epsilon: score += 1.0
+        else: score += 3.0
+        scores.append(score)
+    return scores
+
+
+# Finally our benchmarking function for `speed_check`! We shall limit the timer to 10 seconds and do 3 trials.
+
+# In[ ]:
+
+
+A, A_list, B, B_list = generate_random_matrices(seed = 0, n = 256)
+benchmarker = Benchmarker(trials = 3, timeout = 10)
+numpy_results = benchmarker.benchmark(np.matmul, [(A, B)])
+numpy_results
+
+
+# In[ ]:
+
+
+new_matmul = create_locked_down_function(extract_function(prompt))
+new_results = benchmarker.benchmark(new_matmul, [(A_list, B_list)])
+new_results
+
+
+# We can take the difference and do a negative sign for slower ones. If the ratio is less than 1 (ie faster, we shall invert it!)
+
+# In[ ]:
+
+
+negative = -(new_results["median_ns"] / numpy_results["median_ns"]) / 100
+positive = +(numpy_results["median_ns"] / new_results["median_ns"]) / 100
+reward = negative if new_results["median_ns"] >= numpy_results["median_ns"] else positive
+reward
+
+
+# In[ ]:
+
+
+new_results["median_ns"] = 3
+numpy_results["median_ns"] = 1000
+negative = -(new_results["median_ns"] / numpy_results["median_ns"]) / 100
+positive = +(numpy_results["median_ns"] / new_results["median_ns"]) / 100
+reward = negative if new_results["median_ns"] >= numpy_results["median_ns"] else positive
+reward
+
+
+# In[ ]:
+
+
+import gc
+def speed_check(completions, **kwargs):
+    scores = []
+    # Generate some random matrices of size less than 256
+    A, A_list, B, B_list = generate_random_matrices(seed = np.random.randint(10000), n = 256)
+    numpy_results = benchmarker.benchmark(np.matmul, [(A, B)])
+    for completion in completions:
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+        if function is not None:
+            ok, info = check_only_stdlib_imports(function)
+        if function is None or "error" in info:
+            scores.append(0)
+            continue
+        try:
+            new_matmul = create_locked_down_function(function)
+        except:
+            scores.append(0)
+            continue
+        new_results = benchmarker.benchmark(new_matmul, [(A_list.copy(), B_list.copy())])
+
+        # Get score and clip to -10, 10
+        negative = -(new_results["median_ns"] / numpy_results["median_ns"]) / 100
+        positive = +(numpy_results["median_ns"] / new_results["median_ns"]) / 100
+        score = negative if new_results["median_ns"] >= numpy_results["median_ns"] else positive
+        if score >= 10:  score = 10
+        if score <= -10: score = -10
+        scores.append(score)
+    # Free memory to counteract OOMs
+    gc.collect()
+    torch.cuda.empty_cache()
+    return scores
+
+
+# We create the dataset which includes a replica of our prompt.
+
+# In[ ]:
+
+
+from datasets import Dataset
+dataset = Dataset.from_list([{"prompt" : [{"role": "user", "content": prompt.strip()}], "answer" : 0}]*1000)
+maximum_length = len(tokenizer.apply_chat_template([{"role":"user", "content":prompt.strip()}], add_generation_prompt = True, tokenize = True))
+print(maximum_length)
+dataset[0]
+
+
+# <a name="Train"></a>
+# ### Train the model
+# 
+# Now set up GRPO Trainer and all configurations! We also support GSDP, GAPO, Dr GRPO and more! Go to our docs https://unsloth.ai/docs/ for more info!
+
+# In[ ]:
+
+
+# Leave room for the prompt (plus 1 token safety margin)
+max_completion_length = max_seq_length - (maximum_length + 1)
+
+from trl import GRPOConfig, GRPOTrainer
+training_args = GRPOConfig(
+    temperature = 1.0,
+    top_p = 0.95,
+    top_k = 64,
+    learning_rate = 5e-5,
+    weight_decay = 0.001,
+    warmup_ratio = 0.1,
+    lr_scheduler_type = "linear",
+    optim = "adamw_8bit",
+    logging_steps = 1,
+    per_device_train_batch_size = 1,
+    gradient_accumulation_steps = 2, # Increase to 4 for smoother training
+    num_generations = 2, # Decrease if out of memory
+    max_completion_length = max_completion_length,
+    # num_train_epochs = 1, # Set to 1 for a full training run
+    max_steps = 100,
+    save_steps = 100,
+    report_to = "none", # Can use Weights & Biases, TrackIO
+    output_dir = "outputs",
+    epsilon = 0.2,
+    epsilon_high = 0.28, # one sided
+    delta = 1.5, # two sided
+    loss_type = 'bnpo',
+    mask_truncated_completions = True
+    # For optional training + evaluation
+    # fp16_full_eval = True,
+    # per_device_eval_batch_size = 4,
+    # eval_accumulation_steps = 1,
+    # eval_strategy = "steps",
+    # eval_steps = 1,
+)
+
+
+# And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!
+# 
+# You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!
+# 
+# | Step | Training Loss | reward    | reward_std | completion_length | kl       |
+# |------|---------------|-----------|------------|-------------------|----------|
+# | 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
+# | 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
+# | 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |
+
+# In[ ]:
+
+
+# For optional training + evaluation
+# new_dataset = dataset.train_test_split(test_size = 0.01)
+
+trainer = GRPOTrainer(
+    model = model,
+    processing_class = tokenizer,
+    reward_funcs = [
+        function_works,
+        no_cheating,
+        correctness_check,
+        speed_check,
+    ],
+    args = training_args,
+    train_dataset = dataset,
+
+    # For optional training + evaluation
+    # train_dataset = new_dataset["train"],
+    # eval_dataset = new_dataset["test"],
+)
+
+
+# And let's train the model!
+# 
+# **NOTE** A T4 free GPU might take 5 minutes for one generation sadly since it's an old GPU - A100 or H100 will be much faster!
+
+# In[ ]:
+
+
+trainer.train()
+
+
+# And now with the LoRA we just trained with GRPO - we first save the LoRA first!
+
+# In[ ]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+tokenizer.save_pretrained("gemma_4_lora")
+
+
+# Verify LoRA is actually trained!
+
+# In[ ]:
+
+
+from safetensors import safe_open
+
+tensors = {}
+with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
+    # Verify both A and B are non zero
+    for key in f.keys():
+        tensor = f.get_tensor(key)
+        n_zeros = (tensor == 0).sum() / tensor.numel()
+        assert(n_zeros.item() != tensor.numel())
+
+
+# <a name="Inference"></a>
+# # Inference
+# Now let's try the model we just trained!
+
+# In[ ]:
+
+
+text = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt.strip()}],
+    tokenize = False,
+    add_generation_prompt = True,
+)
+
+from transformers import TextStreamer
+
+_ = model.generate(
+    **tokenizer(images = None, text = text, return_tensors = "pt").to("cuda"),
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    max_new_tokens = 1024,
+    streamer = TextStreamer(tokenizer, skip_prompt = False),
+)
+
+
+# <a name="Save"></a>
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
+
+# In[ ]:
+
+
+# Merge to 16bit
+if False: model.save_pretrained_merged("gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit",)
+if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit", token = "YOUR_HF_TOKEN")
+
+# Merge to 4bit
+if False: model.save_pretrained_merged("gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit",)
+if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit", token = "YOUR_HF_TOKEN")
+
+# Just LoRA adapters
+if False:
+    model.save_pretrained("gemma_4_lora")
+    tokenizer.save_pretrained("gemma_4_lora")
+if False:
+    model.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
+    tokenizer.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
+# 
+# Some supported quant methods (full list on our [docs page](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf)):
+# * `q8_0` - Fast conversion. High resource use, but generally acceptable.
+# * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
+# * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
+# 
+# [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+
+# In[ ]:
+
+
+# Save to 8bit Q8_0
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer,)
+# Remember to go to https://huggingface.co/settings/tokens for a token!
+# And change hf to your username!
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, token = "YOUR_HF_TOKEN")
+
+# Save to 16bit GGUF
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "f16")
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "f16", token = "YOUR_HF_TOKEN")
+
+# Save to q4_k_m GGUF
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "q4_k_m")
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "q4_k_m", token = "YOUR_HF_TOKEN")
+
+# Save to multiple GGUF options - much faster if you want multiple!
+if False:
+    model.push_to_hub_gguf(
+        "HF_USERNAME/gemma_4_finetune", # Change hf to your username!
+        tokenizer,
+        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma_4_finetune.Q8_0.gguf` file or `gemma_4_finetune.Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,913 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# # Goal: Make Gemma 4 play games with Reinforcement Learning
+# 
+# Our goal is to make Gemma 4 play the 2048 game with reinforcement learning, or a variant of it called [GRPO](https://arxiv.org/abs/2501.12948).
+# 
+# We want the model to devise a strategy to play 2048, and we will run this strategy until we win or lose. We then reward the model if it created a good strategy (winning the game), and we'll penalize it (negative reward) if the strategy was a bad one.
+# 
+# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/2048_win.png/500px-2048_win.png" height=300 />
+
+# # Installation
+# We'll be using [Unsloth](https://github.com/unslothai/unsloth) to do RL on Gemma 4. Unsloth saves 70% VRAM usage and makes reinforcement learning 2 to 6x faster!
+
+# In[ ]:
+
+
+get_ipython().run_cell_magic('capture', '', 'import os, importlib.util\n!pip install --upgrade -qqq uv\nif importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):\n    try: import numpy, PIL; _numpy = f"numpy=={numpy.__version__}"; _pil = f"pillow=={PIL.__version__}"\n    except: _numpy = "numpy"; _pil = "pillow"\n    # Gemma 4 requires transformers >= 5.5.0 — do NOT pin to 4.x here\n    !uv pip install -qqq \\\n        "torch>=2.8.0" "triton>=3.4.0" {_numpy} {_pil} torchvision bitsandbytes \\\n        "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \\\n        "unsloth[base] @ git+https://github.com/unslothai/unsloth" \\\n        git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels\nelif importlib.util.find_spec("unsloth") is None:\n    !uv pip install -qqq unsloth\n# Gemma 4 requires transformers >= 5.5.0\n!uv pip install --upgrade --no-deps "transformers>=5.5.0" tokenizers "trl>=0.28.0" unsloth unsloth_zoo\n')
+
+
+# In[ ]:
+
+
+get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+
+
+# ### Unsloth
+
+# In[ ]:
+
+
+from unsloth import FastVisionModel
+import torch
+max_seq_length = 4096 # Can increase for longer reasoning traces
+lora_rank = 32 # Larger rank = smarter, but slower
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, tokenizer = FastVisionModel.from_pretrained(
+    model_name = "unsloth/gemma-4-E2B-it",
+    max_seq_length = max_seq_length,
+    load_in_4bit = False, # False for LoRA 16bit
+    fast_inference = False, # Enable vllm fast inference
+)
+
+
+# To do efficient RL, we will use [LoRA](https://arxiv.org/abs/2106.09685), which allows us to only add 1 to 5% of extra weights to the model for finetuning purposes. This allows us to save memory usage by over 60%, and yet it retains good accuracy.
+
+# In[ ]:
+
+
+model = FastVisionModel.get_peft_model(
+    model,
+    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
+    target_modules = [
+        "q_proj", "k_proj", "v_proj", "o_proj",
+        "gate_proj", "up_proj", "down_proj",
+    ],
+    lora_alpha = lora_rank*2, # *2 speeds up training
+    use_gradient_checkpointing = "unsloth", # Reduces memory usage
+    random_state = 3407,
+)
+
+
+# # 2048 game
+# 
+# We used GPT-5 to create a variant of the 2048 game. It should output the current game board state, and allow us to advance the game board state with 1 action (up, down, left, right).
+
+# In[ ]:
+
+
+#@title (Collapsible) 2048 Game Implementation
+from dataclasses import dataclass, field
+from typing import List, Tuple, Optional
+import random
+import copy
+
+def _compress_and_merge_row_left(row: List[int]) -> Tuple[List[int], int, bool]:
+    n = len(row)
+    tiles = [x for x in row if x != 0]
+    gained = 0
+    i = 0
+    merged = []
+    while i < len(tiles):
+        if i + 1 < len(tiles) and tiles[i] == tiles[i + 1]:
+            v = tiles[i] * 2
+            gained += v
+            merged.append(v)
+            i += 2
+        else:
+            merged.append(tiles[i])
+            i += 1
+    merged += [0] * (n - len(merged))
+    changed = merged != row
+    return merged, gained, changed
+
+def _move_left(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
+    changed_any = False
+    total_gain = 0
+    new_board = []
+    for row in board:
+        new_row, gained, changed = _compress_and_merge_row_left(row)
+        new_board.append(new_row)
+        total_gain += gained
+        changed_any = changed_any or changed
+    return new_board, total_gain, changed_any
+
+def _move_right(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
+    changed_any = False
+    total_gain = 0
+    new_board = []
+    for row in board:
+        rev = list(reversed(row))
+        new_rev, gained, changed = _compress_and_merge_row_left(rev)
+        new_row = list(reversed(new_rev))
+        new_board.append(new_row)
+        total_gain += gained
+        changed_any = changed_any or changed
+    return new_board, total_gain, changed_any
+
+def _transpose(board: List[List[int]]) -> List[List[int]]:
+    return [list(row) for row in zip(*board)]
+
+def _move_up(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
+    t = _transpose(board)
+    moved, gain, changed = _move_left(t)
+    return _transpose(moved), gain, changed
+
+def _move_down(board: List[List[int]]) -> Tuple[List[List[int]], int, bool]:
+    t = _transpose(board)
+    moved, gain, changed = _move_right(t)
+    return _transpose(moved), gain, changed
+
+def _empty_cells(board: List[List[int]]) -> List[Tuple[int, int]]:
+    size = len(board)
+    return [(r, c) for r in range(size) for c in range(size) if board[r][c] == 0]
+
+def _can_move(board: List[List[int]]) -> bool:
+    if _empty_cells(board):
+        return True
+    size = len(board)
+    for r in range(size):
+        for c in range(size - 1):
+            if board[r][c] == board[r][c + 1]:
+                return True
+    for r in range(size - 1):
+        for c in range(size):
+            if board[r][c] == board[r + 1][c]:
+                return True
+    return False
+
+@dataclass
+class GameBoard:
+    size: int
+    seed: Optional[int] = None
+    target: int = 2048
+    probability_fours: float = 0.10 # originally spawns (4) 10% of the time!
+    _rng: random.Random = field(init = False, repr = False)
+    _board: List[List[int]] = field(init = False, repr = False)
+    _score: int = field(default = 0, init = False, repr = False)
+    _state: str = field(default = "ongoing", init = False, repr = False)
+
+    def __post_init__(self):
+        if self.size < 2:
+            raise ValueError("Board size must be at least 2.")
+        self._rng = random.Random(self.seed)
+        self._board = [[0 for _ in range(self.size)] for _ in range(self.size)]
+        self._add_random_tile()
+        self._add_random_tile()
+        self._update_state_after_change()
+
+    class _BoardView:
+        def __init__(self, game: "GameBoard"):
+            self._game = game
+        def __iter__(self):
+            return iter(self._game._board)
+        def __len__(self):
+            return len(self._game._board)
+        def __getitem__(self, idx):
+            return self._game._board[idx]
+        def __repr__(self) -> str:
+            return repr(self._game._board)
+        __str__ = __repr__
+        def do_action(self, key: str) -> None:
+            self._game.do_action(key)
+        def state(self) -> str:
+            return self._game.state()
+        def pretty(self, colors: bool = True, border: bool = True, dot_for_zero: bool = True) -> str:
+            return self._game._render_pretty(colors = colors, border = border, dot_for_zero = dot_for_zero)
+
+    def board(self) -> "_BoardView":
+        return GameBoard._BoardView(self)
+    def state(self) -> str:
+        return self._state
+    def score(self) -> int:
+        return self._score
+    def do_action(self, key: str) -> None:
+        if self._state != "ongoing":
+            return
+        if not isinstance(key, str) or len(key) == 0:
+            self._state = "failed"
+            return
+        k = key.strip().lower()
+        if k == "q":
+            self._state = "failed"
+            return
+        move_map = {"a": _move_left, "d": _move_right, "w": _move_up, "s": _move_down}
+        if k not in move_map:
+            self._state = "failed"
+            return
+        mover = move_map[k]
+        new_board, gain, changed = mover(self._board)
+        if changed:
+            self._board = new_board
+            self._score += gain
+            self._add_random_tile()
+        self._update_state_after_change()
+    def _add_random_tile(self) -> bool:
+        empties = _empty_cells(self._board)
+        if not empties:
+            return False
+        r, c = self._rng.choice(empties)
+        self._board[r][c] = 4 if self._rng.random() < self.probability_fours else 2
+        return True
+    def _update_state_after_change(self) -> None:
+        if any(self.target in row for row in self._board):
+            self._state = "success"
+            return
+        if not _can_move(self._board):
+            self._state = "failed"
+            return
+        self._state = "ongoing"
+    def _render_pretty(self, colors: bool = True, border: bool = True, dot_for_zero: bool = True) -> str:
+        """
+        Pretty-print the board with colors that scale from 0 up to self.target.
+        Uses ANSI 256-color codes (works in most terminals). Set colors = False to disable.
+        """
+        import math
+
+        b = self._board
+        mx = max((max(row) for row in b), default = 0)
+        cell_w = max(3, len(str(mx)))
+
+        RESET = "\x1b[0m"
+
+        # A smooth-ish gradient from cool → warm
+        # (blue/cyan/green → yellow/orange/red). Tweak or expand as you like.
+        GRAD = [33, 39, 45, 51, 50, 49, 48, 47, 46, 82, 118, 154, 190, 226, 220, 214, 208, 202, 196]
+        ZERO_FG = 239  # dim gray
+
+        def color_code(v: int) -> str:
+            if not colors:
+                return ""
+            if v == 0:
+                return f"\x1b[38;5;{ZERO_FG}m"
+            # Normalize by exponent relative to target: r in [0,1]
+            t = max(2, self.target)  # safety; avoid log2(1)
+            # Guard: if v is not a power of two or is <1, handle gracefully
+            try:
+                r = max(0.0, min(1.0, math.log2(v) / math.log2(t)))
+            except ValueError:
+                r = 0.0
+            idx = int(round(r * (len(GRAD) - 1)))
+            return f"\x1b[38;5;{GRAD[idx]}m"
+
+        def fmt(v: int) -> str:
+            s = "." if (v == 0 and dot_for_zero) else str(v)
+            s = s.rjust(cell_w)
+            return color_code(v) + s + (RESET if colors else "")
+
+        def hline(left: str, mid: str, right: str) -> str:
+            return left + mid.join("─" * cell_w for _ in range(self.size)) + right
+
+        rows = []
+        if border:
+            rows.append(hline("┌", "┬", "┐"))
+        for r in range(self.size):
+            content = "│".join(fmt(v) for v in b[r])
+            rows.append(("│" + content + "│") if border else content)
+            if border:
+                rows.append(hline("└" if r == self.size - 1 else "├",
+                                "┴" if r == self.size - 1 else "┼",
+                                "┘" if r == self.size - 1 else "┤"))
+        return "\n".join(rows)
+
+
+# For example let's create a board of size 5 X 5 and set the target to 8 instead of 2048.
+# 
+# **[NOTE]** 2048 originally spawns a (4) 10% of the time! We can disable this for harder games. See [Wikipedia page](https://en.wikipedia.org/wiki/2048_(video_game)) for more details.
+
+# In[ ]:
+
+
+game = GameBoard(size = 5, seed = 42, target = 8, probability_fours = 0.10)
+print(game.board().pretty(), game.state())
+
+
+# In[ ]:
+
+
+game
+
+
+# We'll use WASD for the action space:
+# 
+# ```
+#    W
+# A  S  D
+# ```
+# Also `game.state()` will say `success` if we succeeded in getting the target!
+
+# In[ ]:
+
+
+game.do_action("A")
+print(game.board().pretty(), game.state())
+
+
+# In[ ]:
+
+
+game.do_action("W")
+print(game.board().pretty(), game.state())
+
+
+# In[ ]:
+
+
+game.do_action("D")
+print(game.board().pretty(), game.state())
+
+
+# In[ ]:
+
+
+game.do_action("W")
+print(game.board().pretty(), game.state())
+
+
+# In[ ]:
+
+
+game.do_action("D")
+print(game.board().pretty(), game.state())
+
+
+# If we do some other action that's not part of the action space, we will get an error, and the game will not accept anymore actions.
+
+# In[ ]:
+
+
+game = GameBoard(size = 3, seed = 42, target = 8, probability_fours = 0.10)
+game.do_action("AA") # Not in WASD
+game.do_action("W")  # Doesn't do anything
+game.do_action("A")  # Doesn't do anything
+print(game.board().pretty(), game.state())
+
+
+# # RL Environment Setup
+# 
+# We'll set up a function to accept some strategy that'll emit an action within `WASD` and check the game state.
+# 
+# We'll also add a timer to only execute the strategy for 2 seconds maximum, otherwise it might never terminate!
+
+# In[ ]:
+
+
+from typing import Callable
+from unsloth import execute_with_time_limit
+
+def _execute_strategy(strategy : Callable, game : GameBoard):
+    assert callable(strategy)
+
+    steps = 0
+    while game.state() == "ongoing":
+        action = strategy(list(game.board()))
+        steps += 1
+        if type(action) is not str:
+            return steps, "failed"
+        game.do_action(action)
+    return steps, game.state()
+
+@execute_with_time_limit(2)
+def execute_strategy(strategy : Callable, game : GameBoard):
+    return _execute_strategy(strategy, game)
+
+
+# Let's make a generic strategy to just hit `W`. We should expect this generic strategy to fail:
+
+# In[ ]:
+
+
+def always_move_left(board):
+    return "W"
+
+game = GameBoard(size = 8, seed = 42, target = 2048, probability_fours = 0.10)
+try:
+    execute_strategy(always_move_left, game)
+except TimeoutError as e:
+    print(f"Timed out with error = {str(e)}")
+
+
+# To allow longer strategies for Gemma 4 Reinforcement Learning, we shall allow a 5 second timer.
+
+# In[ ]:
+
+
+@execute_with_time_limit(5)
+def execute_strategy(strategy : Callable, game : GameBoard):
+    return _execute_strategy(strategy, game)
+
+
+# # Code Execution
+# 
+# To execute and create a new Python function, we first have to check if the function does not call other global variables or cheat. This is called `countering reward hacking` since we don't want the function to cheat.
+# 
+# For example the below piece of code is fine, since it only imports Python level functions. We use `check_python_modules`:
+
+# In[ ]:
+
+
+from unsloth import check_python_modules
+
+sample = """
+def strategy(board):
+    import math
+    from typing import Callable
+    return "W"
+"""
+ok, info = check_python_modules(sample)
+print("Only Python imports?", ok)
+print(info)
+
+
+# For the below piece of code, since we import `numpy`, we should not allow the execution:
+
+# In[ ]:
+
+
+sample = """
+def strategy(board):
+    from numpy import matmul
+    return "W"
+"""
+ok, info = check_python_modules(sample)
+print("Only Python imports?", ok)
+print(info)
+
+
+# We also disallow global variable access. We'll use Unsloth's `create_locked_down_function` function
+
+# In[ ]:
+
+
+from unsloth import create_locked_down_function
+function = """
+def import_numpy():
+    np.matmul
+    print("Success")
+"""
+f = create_locked_down_function(function)
+try:
+    f()
+except Exception as e:
+    print(str(e))
+
+
+# In[ ]:
+
+
+from unsloth import create_locked_down_function
+function = """
+def add(a, b):
+    def adder(a):
+        return a + b
+    return adder(b) + b
+"""
+f = create_locked_down_function(function)
+try:
+    print(f(10, 20))
+except Exception as e:
+    print(str(e))
+
+
+# # Data & RL task setup
+# 
+# We now have to create a prompt to tell the model to create a strategy for the 2048 game. You can customize this to some other task for another RL task.
+
+# In[ ]:
+
+
+prompt = """
+Create a new short 2048 strategy using only native Python code.
+You are given a list of list of numbers for the current board state.
+Output one action for "W", "A", "S", "D" on what is the optimal next step.
+Output your new short function in backticks using the format below:
+```python
+def strategy(board):
+    return "W" # Example
+```
+All helper functions should be inside def strategy. Only output the short function `strategy`.
+""".strip()
+print(prompt)
+
+
+# First, let's prompt Gemma 4 without RL and see how it goes:
+
+# In[ ]:
+
+
+text = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt.strip()}],
+    tokenize = False,
+    add_generation_prompt = True,
+)
+
+from transformers import TextStreamer
+print("=" * 50)
+print("BASE MODEL OUTPUT (before RL training):")
+print("=" * 50)
+
+inputs = tokenizer(
+    text = text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+text_streamer = TextStreamer(tokenizer, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# # Reward functions
+# 
+# We now design a `extract_function` function which simply extracts the function wrapped in 3 back ticks.
+# 
+# And 3 reward functions:
+# 
+# 1. `function_works` which rewards the model if the strategy is a valid Python function.
+# 2. `no_cheating` which checks if the function imported other modules, and if it did, we penalize it.
+# 3. `strategy_succeeds` which checks if the game strategy actually succeeds in attaining 2048 after running the auto-generated strategy.
+
+# In[ ]:
+
+
+def extract_function(text):
+    if text.count("```") >= 2:
+        first = text.find("```") + 3
+        second = text.find("```", first)
+        fx = text[first : second].strip()
+        fx = fx.removeprefix("python\n")
+        fx = fx[fx.find("def"):]
+        if fx.startswith("def strategy(board):"): return fx
+    return None
+print(extract_function(prompt))
+
+
+# Below is our `function_works` reward function which uses Python's `exec` but guarded by not allowing leakage of local and global variables. We can also use `check_python_modules` first to check if there are errors before even executing the function:
+
+# In[ ]:
+
+
+ok, info = check_python_modules("def a")
+ok, info
+
+
+# In[ ]:
+
+
+def function_works(completions, **kwargs):
+    scores = []
+    for completion in completions:
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+        if function is not None:
+            ok, info = check_python_modules(function)
+        if function is None or "error" in info:
+            score = -2.0
+        else:
+            try:
+                new_strategy = create_locked_down_function(function)
+                score = 1.0
+            except:
+                score = -0.5
+        scores.append(score)
+    return scores
+
+
+# `no_cheating` checks if the function cheated since it might have imported Numpy or other functions:
+
+# In[ ]:
+
+
+def no_cheating(completions, **kwargs):
+    scores = []
+    for completion in completions:
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+        if function is not None:
+            ok, info = check_python_modules(function)
+            scores.append(1.0 if ok else -20.0) # Penalize heavily!
+        else:
+            scores.append(-1.0) # Failed creating function
+    return scores
+
+
+# Next `strategy_succeeds` checks if the strategy actually allows the game to terminate. Imagine if the strategy simply returned "W" which would fail after a time limit of 10 seconds.
+# 
+# We also add a global `PRINTER` to print out the strategy and board state.
+
+# In[ ]:
+
+
+import numpy as np
+global PRINTER
+PRINTER = 0
+def strategy_succeeds(completions, **kwargs):
+    global PRINTER
+    scores = []
+    # Generate a random game board with seed
+    seed = np.random.randint(10000)
+    for completion in completions:
+        printed = False
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+        if PRINTER % 5 == 0:
+            printed = True
+            print(function)
+        PRINTER += 1
+        if function is not None:
+            ok, info = check_python_modules(function)
+        if function is None or "error" in info:
+            scores.append(0)
+            continue
+        try:
+            new_strategy = create_locked_down_function(function)
+        except:
+            scores.append(0)
+            continue
+        try:
+            game = GameBoard(size = 6, seed = seed, target = 2048, probability_fours = 0.10)
+            steps, game_state = execute_strategy(new_strategy, game)
+            print(f"Steps = {steps} State = {game_state}")
+            if printed is False:
+                print(function)
+            print(game.board().pretty())
+            if game_state == "success":
+                scores.append(20.0) # Success - massively reward!
+            else:
+                scores.append(2.0) # Failed but function works!
+        except TimeoutError as e:
+            print("Timeout")
+            scores.append(-1.0) # Failed with timeout
+        except Exception as e:
+            print(f"Exception = {str(e)}")
+            scores.append(-3.0) # Failed
+    return scores
+
+
+# We'll now create the dataset which includes a replica of our prompt.
+
+# In[ ]:
+
+
+from datasets import Dataset
+dataset = Dataset.from_list([{"prompt" : [{"role": "user", "content": prompt.strip()}], "answer" : 0}]*1000)
+maximum_length = len(tokenizer.apply_chat_template([{"role":"user", "content":prompt.strip()}], add_generation_prompt = True, tokenize = True))
+print(maximum_length)
+dataset[0]
+
+
+# <a name="Train"></a>
+# ### Train the model
+# 
+# Now set up GRPO Trainer and all configurations! We also support GSPO, GAPO, Dr GRPO and more! Go the Unsloth [Reinforcement Learning Docs](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) for more options.
+
+# In[ ]:
+
+
+# Leave room for the prompt (plus 1 token safety margin)
+max_completion_length = max_seq_length - (maximum_length + 1)
+
+from trl import GRPOConfig, GRPOTrainer
+training_args = GRPOConfig(
+    temperature = 1.0,
+    top_p = 0.95,
+    top_k = 64,
+    learning_rate = 5e-5,
+    weight_decay = 0.001,
+    warmup_ratio = 0.1,
+    lr_scheduler_type = "linear",
+    optim = "adamw_8bit",
+    logging_steps = 1,
+    per_device_train_batch_size = 1,
+    gradient_accumulation_steps = 2, # Increase to 4 for smoother training
+    num_generations = 2, # Decrease if out of memory
+    max_completion_length = max_completion_length,
+    # num_train_epochs = 1, # Set to 1 for a full training run
+    max_steps = 60,
+    save_steps = 100,
+    report_to = "none", # Can use Weights & Biases, TrackIO
+    output_dir = "outputs",
+    epsilon = 0.2,
+    epsilon_high = 0.28, # one sided
+    delta = 1.5, # two sided
+    loss_type = 'bnpo',
+    mask_truncated_completions = True
+    # For optional training + evaluation
+    # fp16_full_eval = True,
+    # per_device_eval_batch_size = 4,
+    # eval_accumulation_steps = 1,
+    # eval_strategy = "steps",
+    # eval_steps = 1,
+)
+
+
+# And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!
+# 
+# You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!
+# 
+# | Step | Training Loss | reward    | reward_std | completion_length | kl       |
+# |------|---------------|-----------|------------|-------------------|----------|
+# | 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
+# | 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
+# | 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |
+
+# In[ ]:
+
+
+# For optional training + evaluation
+# new_dataset = dataset.train_test_split(test_size = 0.01)
+
+trainer = GRPOTrainer(
+    model = model,
+    processing_class = tokenizer,
+    reward_funcs = [
+        function_works,
+        no_cheating,
+        strategy_succeeds,
+    ],
+    args = training_args,
+    train_dataset = dataset,
+
+    # For optional training + evaluation
+    # train_dataset = new_dataset["train"],
+    # eval_dataset = new_dataset["test"],
+)
+
+
+# And let's train the model!
+# 
+# **NOTE** A T4 free GPU might take 5 minutes for one generation sadly since it's an old GPU - A100 or H100 will be much faster!
+
+# In[ ]:
+
+
+trainer.train()
+
+
+# And now with the LoRA we just trained with GRPO - we first save the LoRA first!
+
+# In[ ]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+tokenizer.save_pretrained("gemma_4_lora")
+
+
+# Verify LoRA is actually trained!
+
+# In[ ]:
+
+
+from safetensors import safe_open
+
+tensors = {}
+with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
+    # Verify both A and B are non zero
+    for key in f.keys():
+        tensor = f.get_tensor(key)
+        n_zeros = (tensor == 0).sum() / tensor.numel()
+        assert(n_zeros.item() != tensor.numel())
+
+
+# <a name="Inference"></a>
+# # Inference
+# Now let's try the model we just trained!
+
+# In[ ]:
+
+
+text = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt.strip()}],
+    tokenize = False,
+    add_generation_prompt = True,
+)
+
+from transformers import TextStreamer
+
+_ = model.generate(
+    **tokenizer(images = None, text = text, return_tensors = "pt").to("cuda"),
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    max_new_tokens = 1024,
+    streamer = TextStreamer(tokenizer, skip_prompt = False),
+)
+
+
+# <a name="Save"></a>
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
+
+# In[ ]:
+
+
+# Merge to 16bit
+if False: model.save_pretrained_merged("gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit",)
+if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit", token = "YOUR_HF_TOKEN")
+
+# Merge to 4bit
+if False: model.save_pretrained_merged("gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit",)
+if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit", token = "YOUR_HF_TOKEN")
+
+# Just LoRA adapters
+if False:
+    model.save_pretrained("gemma_4_lora")
+    tokenizer.save_pretrained("gemma_4_lora")
+if False:
+    model.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
+    tokenizer.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
+# 
+# Some supported quant methods (full list on our [docs page](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf)):
+# * `q8_0` - Fast conversion. High resource use, but generally acceptable.
+# * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
+# * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
+# 
+# [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+
+# In[ ]:
+
+
+# Save to 8bit Q8_0
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer,)
+# Remember to go to https://huggingface.co/settings/tokens for a token!
+# And change hf to your username!
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, token = "YOUR_HF_TOKEN")
+
+# Save to 16bit GGUF
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "f16")
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "f16", token = "YOUR_HF_TOKEN")
+
+# Save to q4_k_m GGUF
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "q4_k_m")
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "q4_k_m", token = "YOUR_HF_TOKEN")
+
+# Save to multiple GGUF options - much faster if you want multiple!
+if False:
+    model.push_to_hub_gguf(
+        "HF_USERNAME/gemma_4_finetune", # Change hf to your username!
+        tokenizer,
+        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma_4_finetune.Q8_0.gguf` file or `gemma_4_finetune.Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,897 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# # Goal: Make Gemma 4 solve Sudoku puzzles with Reinforcement Learning
+# 
+# Our goal is to make Gemma 4 learn to solve Sudoku puzzles using reinforcement learning (GRPO).
+# The model will devise a strategy to fill in empty cells, and we'll reward it for correct placements
+# and completing valid puzzles.
+# 
+# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/12/Sudoku_Puzzle_by_L2G-20050714_solution_standardized_layout.svg/1280px-Sudoku_Puzzle_by_L2G-20050714_solution_standardized_layout.svg.png" height="300" />
+
+# # Installation
+# We'll be using [Unsloth](https://github.com/unslothai/unsloth) to do RL on Gemma 4. Unsloth saves 70% VRAM usage and makes reinforcement learning 2 to 6x faster.
+
+# In[ ]:
+
+
+get_ipython().run_cell_magic('capture', '', 'import os, importlib.util\n!pip install --upgrade -qqq uv\nif importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):\n    try: import numpy, PIL; _numpy = f"numpy=={numpy.__version__}"; _pil = f"pillow=={PIL.__version__}"\n    except: _numpy = "numpy"; _pil = "pillow"\n    # Gemma 4 requires transformers >= 5.5.0 — do NOT pin to 4.x here\n    !uv pip install -qqq \\\n        "torch>=2.8.0" "triton>=3.4.0" {_numpy} {_pil} torchvision bitsandbytes \\\n        "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \\\n        "unsloth[base] @ git+https://github.com/unslothai/unsloth" \\\n        git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels\nelif importlib.util.find_spec("unsloth") is None:\n    !uv pip install -qqq unsloth\n# Gemma 4 requires transformers >= 5.5.0\n!uv pip install --upgrade --no-deps "transformers>=5.5.0" tokenizers "trl>=0.28.0" unsloth unsloth_zoo\n')
+
+
+# In[ ]:
+
+
+get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+
+
+# ### Unsloth
+
+# In[ ]:
+
+
+from unsloth import FastVisionModel
+import torch
+max_seq_length = 4096 # Can increase for longer reasoning traces
+lora_rank = 32 # Larger rank = smarter, but slower
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, tokenizer = FastVisionModel.from_pretrained(
+    model_name = "unsloth/gemma-4-E2B-it",
+    max_seq_length = max_seq_length,
+    load_in_4bit = False, # False for LoRA 16bit
+    fast_inference = False, # Enable vllm fast inference
+)
+
+
+# To do efficient RL, we will use [LoRA](https://arxiv.org/abs/2106.09685), which allows us to only add 1 to 5% of extra weights to the model for finetuning purposes. This allows us to save memory usage by over 60%, and yet it retains good accuracy.
+
+# In[ ]:
+
+
+model = FastVisionModel.get_peft_model(
+    model,
+    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
+    target_modules = [
+        "q_proj", "k_proj", "v_proj", "o_proj",
+        "gate_proj", "up_proj", "down_proj",
+    ],
+    lora_alpha = lora_rank*2, # *2 speeds up training
+    use_gradient_checkpointing = "unsloth", # Reduces memory usage
+    random_state = 3407,
+)
+
+
+# # Sudoku Game Implementation
+# 
+# We use GPT-5 to create a clean Sudoku solver environment. The strategy outputs "row,col,value" to fill cells.
+
+# In[ ]:
+
+
+#@title Sudoku Game Implementation
+from dataclasses import dataclass, field
+from typing import List, Tuple, Optional
+import random
+import copy
+
+def _is_valid_placement(board: List[List[int]], row: int, col: int, num: int) -> bool:
+    """Check if placing num at (row, col) is valid."""
+    # Check row
+    if num in board[row]:
+        return False
+
+    # Check column
+    if num in [board[r][col] for r in range(9)]:
+        return False
+
+    # Check 3x3 box
+    box_row, box_col = 3 * (row // 3), 3 * (col // 3)
+    for r in range(box_row, box_row + 3):
+        for c in range(box_col, box_col + 3):
+            if board[r][c] == num:
+                return False
+
+    return True
+
+def _solve_sudoku(board: List[List[int]]) -> bool:
+    """Solve sudoku using backtracking (for puzzle generation)."""
+    for row in range(9):
+        for col in range(9):
+            if board[row][col] == 0:
+                for num in range(1, 10):
+                    if _is_valid_placement(board, row, col, num):
+                        board[row][col] = num
+                        if _solve_sudoku(board):
+                            return True
+                        board[row][col] = 0
+                return False
+    return True
+
+def _generate_complete_board(rng: random.Random) -> List[List[int]]:
+    """Generate a complete valid Sudoku board."""
+    board = [[0 for _ in range(9)] for _ in range(9)]
+
+    # Fill diagonal 3x3 boxes first (they don't affect each other)
+    for box in range(3):
+        nums = list(range(1, 10))
+        rng.shuffle(nums)
+        for i in range(3):
+            for j in range(3):
+                board[box * 3 + i][box * 3 + j] = nums[i * 3 + j]
+
+    # Solve the rest
+    _solve_sudoku(board)
+    return board
+
+@dataclass
+class SudokuGame:
+    difficulty: int = 40  # Number of cells to remove (20 = easy, 40 = medium, 50 = hard)
+    seed: Optional[int] = None
+    _rng: random.Random = field(init = False, repr = False)
+    _board: List[List[int]] = field(init = False, repr = False)
+    _solution: List[List[int]] = field(init = False, repr = False)
+    _initial_board: List[List[int]] = field(init = False, repr = False)
+    _moves: int = field(default = 0, init = False, repr = False)
+    _state: str = field(default = "ongoing", init = False, repr = False)
+
+    def __post_init__(self):
+        self._rng = random.Random(self.seed)
+
+        # Generate complete board
+        complete_board = _generate_complete_board(self._rng)
+        self._solution = copy.deepcopy(complete_board)
+
+        # Remove cells to create puzzle
+        self._board = copy.deepcopy(complete_board)
+        cells = [(r, c) for r in range(9) for c in range(9)]
+        self._rng.shuffle(cells)
+
+        for r, c in cells[:self.difficulty]:
+            self._board[r][c] = 0
+
+        self._initial_board = copy.deepcopy(self._board)
+        self._update_state()
+
+    def board(self) -> List[List[int]]:
+        """Return current board state."""
+        return [row[:] for row in self._board]
+
+    def initial_board(self) -> List[List[int]]:
+        """Return initial puzzle state."""
+        return [row[:] for row in self._initial_board]
+
+    def state(self) -> str:
+        """Return game state: 'ongoing', 'success', or 'failed'."""
+        return self._state
+
+    def moves(self) -> int:
+        """Return number of moves made."""
+        return self._moves
+
+    def place_number(self, row: int, col: int, num: int) -> bool:
+        """Place a number on the board. Returns True if valid move."""
+        # Validate input
+        if not (0 <= row < 9 and 0 <= col < 9):
+            self._state = "failed"
+            return False
+
+        if not (1 <= num <= 9):
+            self._state = "failed"
+            return False
+
+        # Can't modify initial cells
+        if self._initial_board[row][col] != 0:
+            self._state = "failed"
+            return False
+        if self._board[row][col] != 0:
+            self._state = "failed"
+            return False
+        # Check if placement is valid
+        if not _is_valid_placement(self._board, row, col, num):
+            self._state = "failed"
+            return False
+
+        # Place number
+        self._board[row][col] = num
+        self._moves += 1
+        self._update_state()
+        return True
+
+    def _update_state(self) -> None:
+        """Update game state based on current board."""
+        # Check if puzzle is complete
+        if all(self._board[r][c] != 0 for r in range(9) for c in range(9)):
+            # Verify solution is correct
+            if self._board == self._solution:
+                self._state = "success"
+            else:
+                self._state = "failed"
+        else:
+            self._state = "ongoing"
+
+    def pretty(self, colors: bool = True) -> str:
+        """Pretty print the Sudoku board."""
+        RESET = "\x1b[0m"
+        INITIAL = "\x1b[38;5;45m"   # Cyan for initial numbers
+        PLACED = "\x1b[38;5;226m"    # Yellow for placed numbers
+        EMPTY = "\x1b[38;5;239m"     # Gray for empty cells
+
+        lines = []
+        lines.append("┌───────┬───────┬───────┐")
+
+        for row in range(9):
+            row_str = "│ "
+            for col in range(9):
+                num = self._board[row][col]
+
+                if colors:
+                    if num == 0:
+                        row_str += f"{EMPTY}.{RESET}"
+                    elif self._initial_board[row][col] != 0:
+                        row_str += f"{INITIAL}{num}{RESET}"
+                    else:
+                        row_str += f"{PLACED}{num}{RESET}"
+                else:
+                    row_str += str(num) if num != 0 else "."
+
+                if col % 3 == 2:
+                    row_str += " │ "
+                else:
+                    row_str += " "
+
+            lines.append(row_str.rstrip())
+
+            if row == 8:
+                lines.append("└───────┴───────┴───────┘")
+            elif row % 3 == 2:
+                lines.append("├───────┼───────┼───────┤")
+
+        return "\n".join(lines)
+
+
+# Test the Sudoku environment:
+
+# In[ ]:
+
+
+# Create an easy puzzle
+game = SudokuGame(difficulty = 30, seed = 42)
+print("Initial puzzle:")
+print(game.pretty())
+print(f"\nState: {game.state()}, Moves: {game.moves()}")
+
+
+# In[ ]:
+
+
+game
+
+
+# Try making some moves:
+
+# In[ ]:
+
+
+# Make a valid move
+game.place_number(0, 1, 7)
+print("\nAfter placing 7 at (1,0):")
+print(game.pretty())
+print(f"State: {game.state()}, Moves: {game.moves()}")
+
+
+# If we do some other action that's not part of the action space, we will get an error, and the game will not accept anymore actions.
+
+# # RL Environment Setup
+# 
+# Execute strategies with time limits to prevent infinite loops.
+
+# In[ ]:
+
+
+from typing import Callable
+from unsloth import execute_with_time_limit
+
+def _execute_strategy(strategy: Callable, game: SudokuGame):
+    """Execute a strategy function on a Sudoku game."""
+    assert callable(strategy)
+
+    max_moves = 100
+    valid_moves = 0  # Track successful moves
+
+    while game.state() == "ongoing" and valid_moves < max_moves:
+        try:
+            board = game.board()
+            initial = game.initial_board()
+            result = strategy(board, initial)
+
+            # Validate result format
+            if not isinstance(result, (tuple, list)) or len(result) != 3:
+                # Invalid format = immediate fail, but return valid moves made
+                return valid_moves, "failed"
+
+            row, col, num = result
+
+            # Validate types
+            if not all(isinstance(x, int) for x in [row, col, num]):
+                return valid_moves, "failed"
+
+            # Try to place number
+            success = game.place_number(row, col, num)
+
+            if success:
+                valid_moves += 1  # Count this valid move
+            else:
+                # Invalid move = game fails, but return valid_moves made so far
+                return valid_moves, "failed"
+
+        except Exception:
+            return valid_moves, "failed"
+
+    if valid_moves >= max_moves and game.state() == "ongoing":
+        return valid_moves, "failed"
+
+    return valid_moves, game.state()
+
+
+# To allow longer strategies for Reinforcement Learning, we shall allow a 10 second timer.
+
+# In[ ]:
+
+
+@execute_with_time_limit(10)
+def execute_strategy(strategy: Callable, game: SudokuGame):
+    """Execute strategy with 10 second time limit."""
+    return _execute_strategy(strategy, game)
+
+
+# Test with a simple strategy:
+
+# In[ ]:
+
+
+def simple_strategy(board, initial):
+    """Simple strategy: fill first empty cell with 1."""
+    for r in range(9):
+        for c in range(9):
+            if board[r][c] == 0 and initial[r][c] == 0:
+                return (r, c, 7)
+    return (0, 0, 7)
+
+game = SudokuGame(difficulty = 30, seed = 42)
+try:
+    moves, state = execute_strategy(simple_strategy, game)
+    print(f"Moves: {moves}, State: {state}")
+except TimeoutError as e:
+    print(f"Timed out: {e}")
+
+
+# In[ ]:
+
+
+print(game.pretty())
+
+
+# # Code Execution
+# 
+# To execute and create a new Python function, we first have to check if the function does not call other global variables or cheat. This is called `countering reward hacking` since we don't want the function to cheat.
+# 
+# For example the below piece of code is fine, since it only imports Python level functions. We use `check_python_modules`:
+
+# In[ ]:
+
+
+from unsloth import check_python_modules, create_locked_down_function
+
+# Test safe code
+sample = """
+def strategy(board, initial):
+    for r in range(9):
+        for c in range(9):
+            if board[r][c] == 0:
+                return (r, c, 1)
+    return (0, 0, 1)
+"""
+
+ok, info = check_python_modules(sample)
+print("Safe Python code?", ok)
+print(info)
+
+
+# For the below piece of code, since we import `numpy`, we should not allow the execution:
+
+# In[ ]:
+
+
+sample = """
+def strategy(board, initial):
+    import numpy as np
+    return (0, 0, 1)
+"""
+
+ok, info = check_python_modules(sample)
+print("Safe Python code?", ok)
+print(info)
+
+
+# # Data & RL task setup
+# 
+# Create the prompt that instructs the model to generate a Sudoku solving strategy. You can customize this to some other task for another RL task.
+
+# In[ ]:
+
+
+prompt = """
+Create a Sudoku solving strategy using only native Python built-in functions without any import statements.
+You are given two lists of lists (9x9 grids):
+- board: current state (0 means empty)
+- initial: starting puzzle (0 means was empty, numbers are fixed)
+
+Return a tuple (row, col, number) for the next move.
+- row: 0-8 (row index)
+- col: 0-8 (column index)
+- number: 1-9 (digit to place)
+
+Only place numbers in cells that are BOTH empty in initial AND empty in board (initial[row][col] == 0 AND board[row][col] == 0)
+Use Sudoku rules: no duplicates in rows, columns, or 3x3 boxes.
+Output your function in backticks:
+```python
+def strategy(board, initial):
+    # Your logic here
+    return (row, col, number)
+```
+All helper functions must be inside def strategy. Output only the function.
+""".strip()
+
+print(prompt)
+
+
+# First, let's prompt the model without RL and see how it goes:
+
+# In[ ]:
+
+
+text = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt.strip()}],
+    tokenize = False,
+    add_generation_prompt = True,
+)
+
+from transformers import TextStreamer
+print("=" * 50)
+print("BASE MODEL OUTPUT (before RL training):")
+print("=" * 50)
+
+inputs = tokenizer(
+    text = text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+text_streamer = TextStreamer(tokenizer, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# # Reward functions
+# 
+# We now design a `extract_function` function which simply extracts the function wrapped in 3 back ticks.
+# 
+# And 3 reward functions:
+# 
+# 1. `function_works` which rewards the model if the strategy is a valid Python function.
+# 2. `no_cheating` which checks if the function imported other modules, and if it did, we penalize it.
+# 3. `strategy_succeeds` which checks if the game strategy actually succeeds in attaining Sudoku after running the auto-generated strategy.
+
+# In[ ]:
+
+
+def extract_function(text):
+    """Extract Python function from markdown code blocks."""
+    if text.count("```") >= 2:
+        first = text.find("```") + 3
+        second = text.find("```", first)
+        fx = text[first:second].strip()
+        fx = fx.removeprefix("python\n")
+        fx = fx[fx.find("def"):]
+        if fx.startswith("def strategy(board, initial):"):
+            return fx
+    return None
+
+
+# **Reward 1: Function Works**
+# 
+# Checks if the generated code is valid Python and can be executed.
+
+# In[ ]:
+
+
+def function_works(completions, **kwargs):
+    """Reward for generating valid executable Python code."""
+    scores = []
+    for completion in completions:
+        score = 0
+        response = completion[0]["content"]
+        function = extract_function(response)
+
+        if function is not None:
+            ok, info = check_python_modules(function)
+
+        if function is None or "error" in info:
+            score = -2.0  # Invalid function
+        else:
+            try:
+                new_strategy = create_locked_down_function(function)
+                score = 1.0  # Valid function
+            except:
+                score = -1.0  # Function has errors
+
+        scores.append(score)
+    return scores
+
+
+# **Reward 2: No Cheating**
+# 
+# Penalizes functions that import external libraries.
+
+# In[ ]:
+
+
+def no_cheating(completions, **kwargs):
+    """Penalize use of external imports."""
+    scores = []
+    for completion in completions:
+        response = completion[0]["content"]
+        function = extract_function(response)
+
+        if function is not None:
+            ok, info = check_python_modules(function)
+            scores.append(1.0 if ok else -20.0)  # Heavy penalty for cheating
+        else:
+            scores.append(-1.0)  # Failed to create function
+
+    return scores
+
+
+# **Reward 3: Strategy Succeeds**
+# 
+# Rewards strategies that successfully solve Sudoku puzzles.
+
+# In[ ]:
+
+
+import numpy as np
+
+global PRINTER
+PRINTER = 0
+
+def strategy_succeeds(completions, **kwargs):
+    """Reward valid moves even if strategy eventually fails."""
+    global PRINTER
+    scores = []
+
+    seed = np.random.randint(10000)
+    difficulty = 40
+    for completion in completions:
+        printed = False
+        response = completion[0]["content"]
+        function = extract_function(response)
+
+        if PRINTER % 5 == 0:
+            printed = True
+            print("\n" + "=" * 60)
+            print(function)
+            print("=" * 60)
+        PRINTER += 1
+
+        if function is not None:
+            ok, info = check_python_modules(function)
+
+        if function is None or "error" in info:
+            scores.append(0)
+            continue
+
+        try:
+            new_strategy = create_locked_down_function(function)
+        except:
+            scores.append(0)
+            continue
+
+        try:
+            game = SudokuGame(difficulty = difficulty, seed = seed)
+            valid_moves, game_state = execute_strategy(new_strategy, game)
+            if valid_moves == difficulty:
+                game_state = "success"
+
+            print(f"\n Valid moves: {valid_moves}, Final state: {game_state}")
+
+            if not printed:
+                print("Strategy:")
+                print(function[:200] + "..." if len(function) > 200 else function)
+
+            print("\nFinal board:")
+            print(game.pretty())
+
+            if game_state == "success":
+                scores.append(30.0)  # Solved the puzzle!
+            elif valid_moves > 0:
+                # Reward based on valid moves made before failure
+                # Each valid move is worth 0.2 points
+                reward = valid_moves * 0.2
+                scores.append(reward)
+            else:
+                scores.append(-2.0)  # Failed immediately with no valid moves
+
+        except TimeoutError:
+            print("Timeout")
+            scores.append(-1.0)
+        except Exception as e:
+            print(f"Exception: {str(e)[:100]}")
+            scores.append(-3.0)
+
+    return scores
+
+
+# # Dataset Preparation
+# 
+# Create the training dataset.
+
+# In[ ]:
+
+
+from datasets import Dataset
+
+dataset = Dataset.from_list([
+    {
+        "prompt": [{"role": "user", "content": prompt.strip()}],
+        "answer": 0,
+    }
+] * 1000)
+
+maximum_length = len(tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt.strip()}],
+    add_generation_prompt = True
+))
+
+print(f"Maximum prompt length: {maximum_length}")
+print("\nDataset sample:")
+print(dataset[0])
+
+
+# <a name="Train"></a>
+# ### Train the model
+# 
+# Now set up GRPO Trainer and all configurations! We also support GSPO, GAPO, Dr GRPO and more! Go the Unsloth [Reinforcement Learning Docs](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) for more options.
+
+# In[ ]:
+
+
+# Leave room for the prompt (plus 1 token safety margin)
+max_completion_length = max_seq_length - (maximum_length + 1)
+
+from trl import GRPOConfig, GRPOTrainer
+training_args = GRPOConfig(
+    temperature = 1.0,
+    learning_rate = 5e-5,
+    weight_decay = 0.001,
+    warmup_ratio = 0.1,
+    lr_scheduler_type = "linear",
+    optim = "adamw_8bit",
+    logging_steps = 1,
+    per_device_train_batch_size = 1,
+    gradient_accumulation_steps = 2, # Increase to 4 for smoother training
+    num_generations = 2, # Decrease if out of memory
+    max_completion_length = max_completion_length,
+    # num_train_epochs = 1, # Set to 1 for a full training run
+    max_steps = 60,
+    save_steps = 100,
+    report_to = "none", # Can use Weights & Biases, TrackIO
+    output_dir = "outputs",
+    epsilon = 0.2,
+    epsilon_high = 0.28, # one sided
+    delta = 1.5, # two sided
+    loss_type = 'bnpo',
+    mask_truncated_completions = True
+    # For optional training + evaluation
+    # fp16_full_eval = True,
+    # per_device_eval_batch_size = 4,
+    # eval_accumulation_steps = 1,
+    # eval_strategy = "steps",
+    # eval_steps = 1,
+)
+
+
+# And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!
+# 
+# You might have to wait 150 to 200 steps for any action. You'll probably get low reward for the first 100 steps. Please be patient!
+# 
+# | Step | Training Loss | reward    | reward_std | completion_length | kl       |
+# |------|---------------|-----------|------------|-------------------|----------|
+# | 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
+# | 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
+# | 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |
+
+# In[ ]:
+
+
+# For optional training + evaluation
+# new_dataset = dataset.train_test_split(test_size = 0.01)
+
+trainer = GRPOTrainer(
+    model = model,
+    processing_class = tokenizer,
+    reward_funcs = [
+        function_works,
+        no_cheating,
+        strategy_succeeds,
+    ],
+    args = training_args,
+    train_dataset = dataset,
+
+    # For optional training + evaluation
+    # train_dataset = new_dataset["train"],
+    # eval_dataset = new_dataset["test"],
+)
+
+
+# And let's train the model!
+# 
+# **NOTE** A T4 free GPU might take 5 minutes for one generation sadly since it's an old GPU - A100 or H100 will be much faster!
+
+# In[ ]:
+
+
+trainer.train()
+
+
+# And now with the LoRA we just trained with GRPO - we first save the LoRA first!
+
+# In[ ]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+tokenizer.save_pretrained("gemma_4_lora")
+
+
+# Verify LoRA is actually trained!
+
+# In[ ]:
+
+
+from safetensors import safe_open
+
+tensors = {}
+with safe_open("grpo_saved_lora/adapter_model.safetensors", framework = "pt") as f:
+    # Verify both A and B are non zero
+    for key in f.keys():
+        tensor = f.get_tensor(key)
+        n_zeros = (tensor == 0).sum() / tensor.numel()
+        assert(n_zeros.item() != tensor.numel())
+
+
+# <a name="Inference"></a>
+# # Inference
+# Now let's try the model we just trained!
+
+# In[ ]:
+
+
+text = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt.strip()}],
+    tokenize = False,
+    add_generation_prompt = True,
+)
+
+from transformers import TextStreamer
+
+_ = model.generate(
+    **tokenizer(images = None,text = text, return_tensors = "pt").to("cuda"),
+    temperature = 1.0,
+    max_new_tokens = 512,
+    streamer = TextStreamer(tokenizer, skip_prompt = False),
+)
+
+
+# <a name="Save"></a>
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
+
+# In[ ]:
+
+
+# Merge to 16bit
+if False: model.save_pretrained_merged("gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit",)
+if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_16bit", tokenizer, save_method = "merged_16bit", token = "YOUR_HF_TOKEN")
+
+# Merge to 4bit
+if False: model.save_pretrained_merged("gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit",)
+if False: model.push_to_hub_merged("HF_USERNAME/gemma_4_finetune_4bit", tokenizer, save_method = "merged_4bit", token = "YOUR_HF_TOKEN")
+
+# Just LoRA adapters
+if False:
+    model.save_pretrained("gemma_4_lora")
+    tokenizer.save_pretrained("gemma_4_lora")
+if False:
+    model.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
+    tokenizer.push_to_hub("HF_USERNAME/gemma_4_lora", token = "YOUR_HF_TOKEN")
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
+# 
+# Some supported quant methods (full list on our [docs page](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf)):
+# * `q8_0` - Fast conversion. High resource use, but generally acceptable.
+# * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
+# * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
+# 
+# [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+
+# In[ ]:
+
+
+# Save to 8bit Q8_0
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer,)
+# Remember to go to https://huggingface.co/settings/tokens for a token!
+# And change hf to your username!
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, token = "YOUR_HF_TOKEN")
+
+# Save to 16bit GGUF
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "f16")
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "f16", token = "YOUR_HF_TOKEN")
+
+# Save to q4_k_m GGUF
+if False: model.save_pretrained_gguf("gemma_4_finetune", tokenizer, quantization_method = "q4_k_m")
+if False: model.push_to_hub_gguf("HF_USERNAME/gemma_4_finetune", tokenizer, quantization_method = "q4_k_m", token = "YOUR_HF_TOKEN")
+
+# Save to multiple GGUF options - much faster if you want multiple!
+if False:
+    model.push_to_hub_gguf(
+        "HF_USERNAME/gemma_4_finetune", # Change hf to your username!
+        tokenizer,
+        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma_4_finetune.Q8_0.gguf` file or `gemma_4_finetune.Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,478 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+# 
+# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
+
+# In[3]:
+
+
+from unsloth import FastModel
+import torch
+from huggingface_hub import snapshot_download
+
+fourbit_models = [
+    # Gemma 4 models
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B-it",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, processor = FastModel.from_pretrained(
+    model_name = "unsloth/gemma-4-E4B-it",
+    dtype = None, # None for auto detection
+    max_seq_length = 8192, # Choose any for long context!
+    load_in_4bit = True,  # 4 bit quantization to reduce memory
+    full_finetuning = False, # [NEW!] We have full finetuning now!
+    # token = "YOUR_HF_TOKEN", # HF Token for gated models
+)
+
+
+# # Gemma 4 can process Text, Vision and Audio!
+# 
+# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
+
+# In[4]:
+
+
+from transformers import TextStreamer
+# Helper function for inference
+def do_gemma_4_inference(messages, max_new_tokens = 128):
+    _ = model.generate(
+        **processor.apply_chat_template(
+            messages,
+            add_generation_prompt = True, # Must add for generation
+            tokenize = True,
+            return_dict = True,
+            return_tensors = "pt",
+        ).to("cuda"),
+        max_new_tokens = max_new_tokens,
+        do_sample = False,
+        streamer = TextStreamer(processor, skip_prompt = True),
+    )
+
+
+# <h3>Let's Evaluate Gemma 4 Baseline Performance on German Transcription</h2>
+
+# In[5]:
+
+
+from datasets import load_dataset,Audio,concatenate_datasets
+
+dataset = load_dataset("kadirnar/Emilia-DE-B000000", split = "train")
+
+# Select a single audio sample to reserve for testing.
+# This index is chosen from the full dataset before we create the smaller training split.
+test_audio = dataset[7546]
+
+dataset = dataset.select(range(3000))
+
+dataset = dataset.cast_column("audio", Audio(sampling_rate = 16000))
+
+
+# In[6]:
+
+
+from IPython.display import Audio, display
+print(test_audio['text'])
+Audio(test_audio['audio']['array'],rate = test_audio['audio']['sampling_rate'])
+
+
+# And the translation of the audio from German to English is:
+# 
+# > I—I hold myself directly accountable. That much is, of course, clear: namely, that there are political interests involved in trade—in the exchange of goods—and that political influences are at play. The question is: that should not be the alternative.
+
+# In[7]:
+
+
+messages = [
+    {
+        "role": "system",
+        "content": [
+            {
+                "type": "text",
+                "text": "You are an assistant that transcribes speech accurately.",
+            }
+        ],
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "audio", "audio": test_audio['audio']['array']},
+            {"type": "text", "text": "Please transcribe this audio."}
+        ]
+    }
+]
+
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# <h3>Baseline Model Performance: 32.43% Word Error Rate (WER) for this sample !</h3>
+
+# # Let's finetune Gemma 4!
+# 
+# You can finetune the vision and text and audio parts
+
+# We now add LoRA adapters so we only need to update a small amount of parameters!
+
+# In[8]:
+
+
+model = FastModel.get_peft_model(
+    model,
+    finetune_vision_layers     = False, # False if not finetuning vision layers
+    finetune_language_layers   = True,  # False if not finetuning language layers
+    finetune_attention_modules = True,  # False if not finetuning attention layers
+    finetune_mlp_modules       = True,  # False if not finetuning MLP layers
+
+    r = 8,                              # The larger, the higher the accuracy, but might overfit
+    lora_alpha = 16,                    # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+    use_rslora = False,                 # We support rank stabilized LoRA
+    loftq_config = None,                # And LoftQ
+    target_modules = [
+        "q_proj", "k_proj", "v_proj", "o_proj",
+        "gate_proj", "up_proj", "down_proj",
+
+        # Audio layers
+        "post", "linear_start", "linear_end",
+        "embedding_projection",
+        "ffw_layer_1", "ffw_layer_2",
+        "output_proj",
+    ]
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We adapt the `kadirnar/Emilia-DE-B000000` dataset for our German ASR task using Gemma 4 multi-modal chat format. Each audio-text pair is structured into a conversation with `system`, `user`, and `assistant` roles. The processor then converts this into the final training format:
+# 
+# ```
+# <bos><|turn>system
+# You are an assistant that transcribes speech accurately.<turn|>
+# <|turn>user
+# <|audio|>Please transcribe this audio.<turn|>
+# <|turn>model
+# Ich, ich rechne direkt mich an.<turn|>
+
+# In[9]:
+
+
+def format_intersection_data(samples: dict) -> dict[str, list]:
+    """Format intersection dataset to match expected message format"""
+    formatted_samples = {"messages": []}
+    for idx in range(len(samples["audio"])):
+        audio = samples["audio"][idx]["array"]
+        label = str(samples["text"][idx])
+
+        message = [
+            {
+                "role": "system",
+                "content": [
+                    {
+                        "type": "text",
+                        "text": "You are an assistant that transcribes speech accurately.",
+                    }
+                ],
+            },
+            {
+                "role": "user",
+                "content": [
+                    {"type": "audio", "audio": audio},
+                    {"type": "text", "text": "Please transcribe this audio."}
+                ]
+            },
+            {
+                "role": "assistant",
+                "content":[{"type": "text", "text": label}]
+            }
+        ]
+        formatted_samples["messages"].append(message)
+    return formatted_samples
+
+
+# In[10]:
+
+
+dataset = dataset.map(format_intersection_data, batched = True, batch_size = 4, num_proc = 4)
+
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
+
+# In[11]:
+
+
+# Use UnslothVisionDataCollator which handles audio token alignment correctly
+from unsloth.trainer import UnslothVisionDataCollator
+from trl import SFTTrainer, SFTConfig
+
+trainer = SFTTrainer(
+    model = model,
+    train_dataset = dataset,
+    processing_class = processor.tokenizer,
+    data_collator = UnslothVisionDataCollator(model, processor),
+    args = SFTConfig(
+        per_device_train_batch_size = 8,
+        gradient_accumulation_steps = 1,
+        warmup_ratio = 0.03,
+        # num_train_epochs = 1, # Use for full training runs
+        max_steps = 60,
+        learning_rate = 5e-5,
+        logging_steps = 1,
+        save_strategy = "steps",
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "cosine",
+        seed = 3407,
+        output_dir = "outputs",
+        report_to = "none",
+        remove_unused_columns = False,
+
+        # The below are a must for audio finetuning:
+        dataset_text_field = "",
+        dataset_kwargs = {"skip_prepare_dataset": True},
+        max_length = 8192,
+    )
+)
+
+
+# In[12]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# # Let's train the model!
+# 
+# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
+
+# In[13]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[14]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64` but for this example we use `do_sample=False` for ASR.
+
+# In[15]:
+
+
+messages = [
+    {
+        "role": "system",
+        "content": [
+            {
+                "type": "text",
+                "text": "You are an assistant that transcribes speech accurately.",
+            }
+        ],
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "audio", "audio": test_audio['audio']['array']},
+            {"type": "text", "text": "Please transcribe this audio."}
+        ]
+    }
+]
+
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[16]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+processor.save_pretrained("gemma_4_lora")
+# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# processor.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[17]:
+
+
+if False:
+    from unsloth import FastModel
+    model, processor = FastModel.from_pretrained(
+        model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
+        max_seq_length = 2048,
+        load_in_4bit = True,
+    )
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "What is Gemma-4?",}]
+}]
+inputs = processor.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 128, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(processor, skip_prompt = True),
+)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
+
+# In[18]:
+
+
+if False: # Change to True to save finetune!
+    model.save_pretrained_merged("gemma-4", processor)
+
+
+# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[19]:
+
+
+if False: # Change to True to upload finetune
+    model.push_to_hub_merged(
+        "HF_ACCOUNT/gemma-4-finetune", processor,
+        token = "YOUR_HF_TOKEN"
+    )
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
+
+# In[20]:
+
+
+if False: # Change to True to save to GGUF
+    model.save_pretrained_gguf(
+        "gemma_4_finetune",
+        processor,
+        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
+    )
+
+
+# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[21]:
+
+
+if False: # Change to True to upload GGUF
+    model.push_to_hub_gguf(
+        "HF_ACCOUNT/gemma_4_finetune",
+        processor,
+        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,557 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab L4 instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+# 
+# `FastModel` supports loading nearly any model now! This includes Vision and Text models!
+
+# In[3]:
+
+
+from unsloth import FastModel
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, tokenizer = FastModel.from_pretrained(
+    model_name = "unsloth/gemma-4-E4B-it",
+    dtype = None, # None for auto detection
+    max_seq_length = 1024, # Choose any for long context!
+    load_in_4bit = True,  # 4 bit quantization to reduce memory
+    full_finetuning = False, # [NEW!] We have full finetuning now!
+    # token = "YOUR_HF_TOKEN", # HF Token for gated models
+)
+
+
+# # Gemma 4 can process Text, Vision and Audio!
+# 
+# Let's first experience how Gemma 4 can handle multimodal inputs. We use Gemma 4's recommended settings of `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[4]:
+
+
+from transformers import TextStreamer
+# Helper function for inference
+def do_gemma_4_inference(messages, max_new_tokens = 128):
+    _ = model.generate(
+        **tokenizer.apply_chat_template(
+            messages,
+            add_generation_prompt = True, # Must add for generation
+            tokenize = True,
+            return_dict = True,
+            return_tensors = "pt",
+        ).to("cuda"),
+        max_new_tokens = max_new_tokens,
+        temperature = 1.0, top_p = 0.95, top_k = 64,
+        streamer = TextStreamer(tokenizer, skip_prompt = True),
+        use_cache = True
+    )
+
+
+# # Gemma 4 can see images!
+# 
+# <img src="https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg" alt="Alt text" height="256">
+
+# In[5]:
+
+
+sloth_link = "https://files.worldwildlife.org/wwfcmsprod/images/Sloth_Sitting_iStock_3_12_2014/story_full_width/8l7pbjmj29_iStock_000011145477Large_mini__1_.jpg"
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "image", "image" : sloth_link },
+        { "type": "text",  "text" : "Which films does this animal feature in?" }
+    ]
+}]
+# You might have to wait 1 minute for Unsloth's auto compiler
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# Let's make a poem about sloths!
+
+# In[6]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{ "type" : "text",
+                  "text" : "Write a poem about sloths." }]
+}]
+do_gemma_4_inference(messages)
+
+
+# # Gemma 4 can also hear!
+
+# In[7]:
+
+
+from IPython.display import Audio, display
+Audio("https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3")
+
+
+# In[8]:
+
+
+get_ipython().system('wget -qqq https://www.nasa.gov/wp-content/uploads/2015/01/591240main_JFKmoonspeech.mp3 -O audio.mp3')
+
+
+# In[9]:
+
+
+audio_file = "audio.mp3"
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "audio", "audio" : audio_file },
+        { "type": "text",  "text" : "What is this audio about?" }
+    ]
+}]
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# # Let's combine all 3 modalities together!
+
+# In[10]:
+
+
+messages = [{
+    "role" : "user",
+    "content": [
+        { "type": "audio", "audio" : audio_file },
+        { "type": "image", "image" : sloth_link },
+        { "type": "text",  "text" : "What is this audio and image about? "\
+                                    "How are they related?" }
+    ]
+}]
+do_gemma_4_inference(messages, max_new_tokens = 256)
+
+
+# # Let's finetune Gemma 4!
+# 
+# You can finetune the vision and text parts for now through selection - the audio part can also be finetuned - we're working to make it selectable as well!
+
+# We now add LoRA adapters so we only need to update a small amount of parameters!
+
+# In[11]:
+
+
+model = FastModel.get_peft_model(
+    model,
+    finetune_vision_layers     = False, # Turn off for just text!
+    finetune_language_layers   = True,  # Should leave on!
+    finetune_attention_modules = True,  # Attention good for GRPO
+    finetune_mlp_modules       = True,  # Should leave on always!
+
+    r = 8,           # Larger = higher accuracy, but might overfit
+    lora_alpha = 8,  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We now use the `Gemma-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-4 renders multi turn conversations like below:
+# 
+# ```
+# <bos><|turn>user
+# Hello<turn|>
+# <|turn>model
+# Hey there!<turn|>
+# ```
+# We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3, gemma-4` and more.
+
+# In[12]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4",
+)
+
+
+# We get the first 3000 rows of the dataset
+
+# In[13]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("mlabonne/FineTome-100k", split = "train[:3000]")
+
+
+# We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!
+
+# In[14]:
+
+
+from unsloth.chat_templates import standardize_data_formats
+dataset = standardize_data_formats(dataset)
+
+
+# Let's see how row 100 looks like!
+
+# In[15]:
+
+
+dataset[100]
+
+
+# We now have to apply the chat template for `Gemma-4` onto the conversations, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.
+
+# In[16]:
+
+
+def formatting_prompts_func(examples):
+   convos = examples["conversations"]
+   texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
+   return { "text" : texts, }
+
+dataset = dataset.map(formatting_prompts_func, batched = True)
+
+
+# Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.
+
+# In[17]:
+
+
+dataset[100]["text"]
+
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
+
+# In[18]:
+
+
+from trl import SFTTrainer, SFTConfig
+trainer = SFTTrainer(
+    model = model,
+    tokenizer = tokenizer,
+    train_dataset = dataset,
+    eval_dataset = None, # Can set up evaluation!
+    args = SFTConfig(
+        dataset_text_field = "text",
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
+        warmup_steps = 5,
+        # num_train_epochs = 1, # Set this for 1 full training run.
+        max_steps = 60,
+        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
+        logging_steps = 1,
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "linear",
+        seed = 3407,
+        report_to = "none", # Use TrackIO/WandB etc
+    ),
+)
+
+
+# We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
+
+# In[19]:
+
+
+from unsloth.chat_templates import train_on_responses_only
+trainer = train_on_responses_only(
+    trainer,
+    instruction_part = "<|turn>user\n",
+    response_part = "<|turn>model\n",
+)
+
+
+# Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!
+
+# In[20]:
+
+
+tokenizer.decode(trainer.train_dataset[100]["input_ids"])
+
+
+# Now let's print the masked out example - you should see only the answer is present:
+
+# In[21]:
+
+
+tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
+
+
+# In[22]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# # Let's train the model!
+# 
+# To resume a training run, set `trainer.train(resume_from_checkpoint = True)`
+
+# In[23]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[24]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model via Unsloth native inference! According to the `Gemma-4` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
+
+# In[25]:
+
+
+from unsloth.chat_templates import get_chat_template
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template = "gemma-4",
+)
+messages = [{
+    "role": "user",
+    "content": [{
+        "type" : "text",
+        "text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
+    }]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+outputs = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+)
+tokenizer.batch_decode(outputs)
+
+
+#  You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
+
+# In[26]:
+
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "Why is the sky blue?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 64, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, either use Hugging Face's `push_to_hub` for an online save or `save_pretrained` for a local save.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[27]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+tokenizer.save_pretrained("gemma_4_lora")
+# model.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# tokenizer.push_to_hub("HF_ACCOUNT/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[28]:
+
+
+if False:
+    from unsloth import FastModel
+    model, tokenizer = FastModel.from_pretrained(
+        model_name = "gemma_4_lora", # YOUR MODEL YOU USED FOR TRAINING
+        max_seq_length = 2048,
+        load_in_4bit = True,
+    )
+
+messages = [{
+    "role": "user",
+    "content": [{"type" : "text", "text" : "What is Gemma-4?",}]
+}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt = True, # Must add for generation
+    return_tensors = "pt",
+    tokenize = True,
+    return_dict = True,
+).to("cuda")
+
+from transformers import TextStreamer
+_ = model.generate(
+    **inputs,
+    max_new_tokens = 128, # Increase for longer outputs!
+    # Recommended Gemma-4 settings!
+    temperature = 1.0, top_p = 0.95, top_k = 64,
+    streamer = TextStreamer(tokenizer, skip_prompt = True),
+)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly for deployment! We save it in the folder `gemma-4-finetune`. Set `if False` to `if True` to let it run!
+
+# In[29]:
+
+
+if False: # Change to True to save finetune!
+    model.save_pretrained_merged("gemma-4-finetune", tokenizer)
+
+
+# If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[30]:
+
+
+if False: # Change to True to upload finetune
+    model.push_to_hub_merged(
+        "HF_ACCOUNT/gemma-4-finetune", tokenizer,
+        token = "YOUR_HF_TOKEN"
+    )
+
+
+# ### GGUF / llama.cpp Conversion
+# To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!
+
+# In[31]:
+
+
+if False: # Change to True to save to GGUF
+    model.save_pretrained_gguf(
+        "gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # For now only Q8_0, BF16, F16 supported
+    )
+
+
+# Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!
+
+# In[32]:
+
+
+if False: # Change to True to upload GGUF
+    model.push_to_hub_gguf(
+        "HF_ACCOUNT/gemma_4_finetune",
+        tokenizer,
+        quantization_method = "Q8_0", # Only Q8_0, BF16, F16 supported
+        token = "YOUR_HF_TOKEN",
+    )
+
+
+# Now, use the `gemma-4-finetune.gguf` file or `gemma-4-finetune-Q4_K_M.gguf` file in llama.cpp.
+# 
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
@@ -0,0 +1,448 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+# To run this, press "*Runtime*" and press "*Run all*" on a Google Colab L4 instance!
+# <div class="align-center">
+# <a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
+# <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
+# </div>
+# 
+# To install Unsloth on your local device, follow [our guide](https://unsloth.ai/docs/get-started/install). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
+# 
+# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & how to save it
+
+# ### News
+
+# Introducing **Unsloth Studio** - a new open source, no-code web UI to train and run LLMs. [Blog](https://unsloth.ai/docs/new/studio) • [Notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb)
+# 
+# <table><tr>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FxV1PO5DbF3ksB51nE2Tw%252Fmore%2520cropped%2520ui%2520for%2520homepage.png%3Falt%3Dmedia%26token%3Df75942c9-3d8d-4b59-8ba2-1a4a38de1b86&width=376&dpr=3&quality=100&sign=a663c397&sv=2" width="200" height="120" alt="Unsloth Studio Training UI"></a><br><sub><b>Train models</b> — no code needed</sub></td>
+# <td align="center"><a href="https://unsloth.ai/docs/new/studio"><img src="https://unsloth.ai/docs/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F~%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252FRCnTAZ6Uh88DIlU3g0Ij%252Fmainpage%2520unsloth.png%3Falt%3Dmedia%26token%3D837c96b6-bd09-4e81-bc76-fa50421e9bfb&width=376&dpr=3&quality=100&sign=c1a39da1&sv=2" width="200" height="120" alt="Unsloth Studio Chat UI"></a><br><sub><b>Run GGUF models</b> on Mac, Windows & Linux</sub></td>
+# </tr></table>
+# 
+# Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. [Blog](https://unsloth.ai/docs/new/faster-moe)
+# 
+# Ultra Long-Context Reinforcement Learning is here with 7x more context windows! [Blog](https://unsloth.ai/docs/new/grpo-long-context)
+# 
+# New in Reinforcement Learning: [FP8 RL](https://unsloth.ai/docs/new/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/new/vision-reinforcement-learning-vlm-rl) • [Standby](https://unsloth.ai/docs/basics/memory-efficient-rl) • [gpt-oss RL](https://unsloth.ai/docs/new/gpt-oss-reinforcement-learning)
+# 
+# Visit our docs for all our [model uploads](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+
+# # ### Installation
+# 
+# # In[1]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n    !pip install unsloth  # Do this in local & cloud setups\nelse:\n    import torch; v = re.match(r\'[\\d]{1,}\\.[\\d]{1,}\', str(torch.__version__)).group(0)\n    xformers = \'xformers==\' + {\'2.10\':\'0.0.34\',\'2.9\':\'0.0.33.post1\',\'2.8\':\'0.0.32.post2\'}.get(v, "0.0.34")\n    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth\n!pip install --no-deps transformers==5.5.0\n!pip install torchcodec\nimport torch; torch._dynamo.config.recompile_limit = 64;\n')
+# 
+# 
+# # In[2]:
+# 
+# 
+# get_ipython().run_cell_magic('capture', '', '!pip install --no-deps --upgrade timm # For Gemma 4 vision/audio\n')
+# 
+# 
+# # ### Unsloth
+
+# In[3]:
+
+
+from unsloth import FastVisionModel # FastLanguageModel for LLMs
+import torch
+
+gemma4_models = [
+    # Gemma-4 instruct models:
+    "unsloth/gemma-4-E2B-it",
+    "unsloth/gemma-4-E4B-it",
+    "unsloth/gemma-4-31B-it",
+    "unsloth/gemma-4-26B-A4B-it",
+    # Gemma-4 base models:
+    "unsloth/gemma-4-E2B",
+    "unsloth/gemma-4-E4B",
+    "unsloth/gemma-4-31B",
+    "unsloth/gemma-4-26B-A4B",
+] # More models at https://huggingface.co/unsloth
+
+model, processor = FastVisionModel.from_pretrained(
+    "unsloth/gemma-4-E4B-it",
+    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
+    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
+)
+
+
+# We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.
+# 
+# **[NEW]** We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!
+
+# In[4]:
+
+
+model = FastVisionModel.get_peft_model(
+    model,
+    finetune_vision_layers     = True, # False if not finetuning vision layers
+    finetune_language_layers   = True, # False if not finetuning language layers
+    finetune_attention_modules = True, # False if not finetuning attention layers
+    finetune_mlp_modules       = True, # False if not finetuning MLP layers
+
+    r = 32,                           # The larger, the higher the accuracy, but might overfit
+    lora_alpha = 32,                  # Recommended alpha == r at least
+    lora_dropout = 0,
+    bias = "none",
+    random_state = 3407,
+    use_rslora = False,               # We support rank stabilized LoRA
+    loftq_config = None,               # And LoftQ
+    target_modules = "all-linear",    # Optional now! Can specify a list if needed
+)
+
+
+# <a name="Data"></a>
+# ### Data Prep
+# We'll use a sampled dataset of handwritten math formulas. The objective is to convert these images into a computer-readable format—specifically LaTeX—so they can be rendered. This is particularly useful for complex expressions.
+# 
+# You can access the dataset [here](https://huggingface.co/datasets/unsloth/LaTeX_OCR). The full dataset is [here](https://huggingface.co/datasets/linxy/LaTeX_OCR).
+
+# In[5]:
+
+
+from datasets import load_dataset
+dataset = load_dataset("unsloth/LaTeX_OCR", split = "train")
+
+
+# Let's take an overview of the dataset. We'll examine the second image and its corresponding caption.
+
+# In[6]:
+
+
+dataset
+
+
+# In[7]:
+
+
+dataset[2]["image"]
+
+
+# In[8]:
+
+
+dataset[2]["text"]
+
+
+# We can also render LaTeX directly in the browser!
+
+# In[9]:
+
+
+from IPython.display import display, Math, Latex
+
+latex = dataset[3]["text"]
+display(Math(latex))
+
+
+# To format the dataset, all vision fine-tuning tasks should follow this format:
+# 
+# ```python
+# [
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": instruction},
+#             {"type": "image", "image": sample["image"]},
+#         ],
+#     },
+# ]
+# ```
+
+# In[10]:
+
+
+instruction = "Write the LaTeX representation for this image."
+
+def convert_to_conversation(sample):
+    conversation = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": instruction},
+                {"type": "image", "image": sample["image"]},
+            ],
+        },
+        {"role": "assistant", "content": [{"type": "text", "text": sample["text"]}]},
+    ]
+    return {"messages": conversation}
+pass
+
+
+# Let's convert the dataset into the "correct" format for finetuning:
+
+# In[11]:
+
+
+converted_dataset = [convert_to_conversation(sample) for sample in dataset]
+
+
+# The first example is now structured like below:
+
+# In[12]:
+
+
+converted_dataset[0]
+
+
+# Lets take the Gemma 4 instruction chat template and use it in our base model
+
+# In[13]:
+
+
+from unsloth import get_chat_template
+
+processor = get_chat_template(
+    processor,
+    "gemma-4"
+)
+
+
+# Before fine-tuning, let us evaluate the base model's performance. We do not expect strong results, as it has not encountered this chat template before.
+
+# In[14]:
+
+
+image = dataset[2]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# You can see it's absolutely terrible! It doesn't follow instructions at all
+
+# <a name="Train"></a>
+# ### Train the model
+# Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support `DPOTrainer` and `GRPOTrainer` for reinforcement learning!
+# 
+# We use our new `UnslothVisionDataCollator` which will help in our vision finetuning setup.
+
+# In[15]:
+
+
+from unsloth.trainer import UnslothVisionDataCollator
+from trl import SFTTrainer, SFTConfig
+
+trainer = SFTTrainer(
+    model = model,
+    train_dataset = converted_dataset,
+    processing_class = processor.tokenizer,
+    data_collator = UnslothVisionDataCollator(model, processor),
+    args = SFTConfig(
+        per_device_train_batch_size = 1,
+        gradient_accumulation_steps = 4,
+        max_grad_norm = 0.3,
+        warmup_ratio = 0.03,
+        max_steps = 60,
+        # num_train_epochs = 2, # Set this instead of max_steps for full training runs
+        learning_rate = 2e-4,
+        logging_steps = 1,
+        save_strategy = "steps",
+        optim = "adamw_8bit",
+        weight_decay = 0.001,
+        lr_scheduler_type = "cosine",
+        seed = 3407,
+        output_dir = "outputs",
+        report_to = "none", # For Weights and Biases or others
+
+        # You MUST put the below items for vision finetuning:
+        remove_unused_columns = False,
+        dataset_text_field = "",
+        dataset_kwargs = {"skip_prepare_dataset": True},
+        max_length = 2048,
+    )
+)
+
+
+# In[16]:
+
+
+# @title Show current memory stats
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+
+
+# In[17]:
+
+
+trainer_stats = trainer.train()
+
+
+# In[18]:
+
+
+# @title Show final memory and time stats
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(
+    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
+)
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+
+
+# <a name="Inference"></a>
+# ### Inference
+# Let's run the model! You can modify the instruction and input—just leave the output blank.
+# 
+# We'll use the best hyperparameters for inference on Gemma: `top_p=0.95`, `top_k=64`, and `temperature=1.0`.
+
+# In[19]:
+
+
+image = dataset[10]["image"]
+instruction = "Write the LaTeX representation for this image."
+
+messages = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
+    }
+]
+
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor, skip_prompt = True)
+result = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                        use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# <a name="Save"></a>
+# ### Saving, loading finetuned models
+# To save the final model as LoRA adapters, use Hugging Face’s `push_to_hub` for online saving, or `save_pretrained` for local storage.
+# 
+# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
+
+# In[20]:
+
+
+model.save_pretrained("gemma_4_lora")  # Local saving
+processor.save_pretrained("gemma_4_lora")
+# model.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+# processor.push_to_hub("your_name/gemma_4_lora", token = "YOUR_HF_TOKEN") # Online saving
+
+
+# Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:
+
+# In[21]:
+
+
+if False:
+    from unsloth import FastVisionModel
+
+    model, processor = FastVisionModel.from_pretrained(
+        model_name = "gemma_4_lora",  # YOUR MODEL YOU USED FOR TRAINING
+        load_in_4bit = True,  # Set to False for 16bit LoRA
+    )
+
+sample = dataset[1]
+image = sample["image"].convert("RGB")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "text",
+                "text": sample["text"],
+            },
+            {
+                "type": "image",
+            },
+        ],
+    },
+]
+input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
+inputs = processor(
+    image,
+    input_text,
+    add_special_tokens = False,
+    return_tensors = "pt",
+).to("cuda")
+
+from transformers import TextStreamer
+
+text_streamer = TextStreamer(processor.tokenizer, skip_prompt = True)
+_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
+                   use_cache = True, temperature = 1.0, top_p = 0.95, top_k = 64)
+
+
+# ### Saving to float16 for VLLM
+# 
+# We also support saving to `float16` directly. Select `merged_16bit` for float16. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See [our docs](https://unsloth.ai/docs/basics/inference-and-deployment) for more deployment options.
+
+# In[22]:
+
+
+# Select ONLY 1 to save! (Both not needed!)
+
+# Save locally to 16bit
+if False: model.save_pretrained_merged("unsloth_finetune", processor,)
+
+# To export and save to your Hugging Face account
+if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", processor, token = "YOUR_HF_TOKEN")
+
+
+# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
+# 
+# Some other resources:
+# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
+# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
+# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
+# 4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://unsloth.ai/docs/get-started/unsloth-notebooks)!
+# 
+# <div class="align-center">
+#   <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
+#   <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
+#   <a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
+# 
+#   Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
+# </div>
+# 
+#   This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).