IBM Granite 4.0

How to run IBM Granite-4.0 with Unsloth GGUFs on llama.cpp, Ollama and how to fine-tune!

IBM releases Granite-4.0 models with 3 sizes including Micro (3B), Tiny (7B/1B active) and Small (32B/9B active). Trained on 15T tokens, IBM’s new Hybrid (H) Mamba architecture enables Granite-4.0 models to run faster with lower memory use.

Learn how to run Unsloth Granite-4.0 Dynamic GGUFs or fine-tune/RL the model. You can fine-tune Granite-4.0 with our free Colab notebook for a support agent use-case.

Running TutorialFine-tuning Tutorial

Unsloth Granite-4.0 uploads:

Dynamic GGUFs

Unsloth Dynamic 4-bit Instruct

16-bit Instruct

FP8 Dynamic

H-Micro

Micro

H-Small

H-Small FP8

H-Tiny FP8

You can also view our Granite-4.0 collection for all uploads including Dynamic Float8 quants etc.

Granite-4.0 Models Explanations:

H-Small (MoE): Enterprise workhorse for daily tasks, supports multiple long-context sessions on entry GPUs like L40S (32B total, 9B active).
H-Tiny (MoE): Fast, cost-efficient for high-volume, low-complexity tasks; optimized for local and edge use (7B total, 1B active).
H-Micro (Dense): Lightweight, efficient for high-volume, low-complexity workloads; ideal for local and edge deployment (3B total).
Micro (Dense): Alternative dense option when Mamba2 isn’t fully supported (3B total).

Run Granite-4.0 Tutorials

⚙️ Recommended Inference Settings

IBM only recommends some settings such as context, so we'll use standard settings:

temperature=1.0, top_p=1.0, top_k=0

Temperature of 1.0
Top_K = 0
Top_P = 1.0
Recommended minimum context: 16,384
Maximum context length window: 131,072 (128K context)

Chat template:

<|start_of_role|>user<|end_of_role|>Please list one IBM Research laboratory located in the United States. You should only output its name and location.<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>Almaden Research Center, San Jose, California<|end_of_text|>

🦙 Ollama: Run Granite-4.0 Tutorial

Install ollama if you haven't already!

apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh

Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload! You can change the model name 'granite-4.0-h-small-GGUF' to any Granite model like 'granite-4.0-h-micro:Q8_K_XL'.

ollama run hf.co/unsloth/granite-4.0-h-small-GGUF:UD-Q4_K_XL

📖 llama.cpp: Run Granite-4.0 Tutorial

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run

./llama.cpp/llama-cli \
    -hf unsloth/granite-4.0-h-small-GGUF:UD-Q4_K_XL

OR download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions (like BF16 full precision).

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/granite-4.0-h-small-GGUF",
    local_dir = "unsloth/granite-4.0-h-small-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*"], # For Q4_K_M
)

Run Unsloth's Flappy Bird test
Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length (Granite-4.0 supports 128K context length!), --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
For conversation mode:

./llama.cpp/llama-mtmd-cli \
    --model unsloth/granite-4.0-h-small-GGUF/granite-4.0-h-small-UD-Q4_K_XL.gguf \
    --threads 32 \
    --jinja \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 1.0 \
    --top-k 0 \
    --top-p 1.0

🐋 Docker: Run Granite-4.0 Tutorial

If you already have Docker desktop, all your need to do is run the command below and you're done:

docker model pull hf.co/unsloth/granite-4.0-h-small-GGUF:UD-Q4_K_XL

🦥 Fine-tuning Granite-4.0 in Unsloth

Unsloth now supports all Granite 4.0 models including micro, tiny and small for fine-tuning. Training is 2x faster, use 50% less VRAM and supports 6x longer context lengths. Granite-4.0 micro and tiny fit comfortably in a 15GB VRAM T4 GPU.

Granite-4.0 free fine-tuning notebook

This notebook trains a model to become a Support Agent that understands customer interactions, complete with analysis and recommendations. This setup allows you to train a bot that provides real-time assistance to support agents.

We also show you how to train a model using data stored in a Google Sheet.

Unsloth config for Granite-4.0:

!pip install --upgrade unsloth
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/granite-4.0-h-micro",
    max_seq_length = 2048,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    # token = "hf_...",      # use one if using gated models
)

If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:

pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo

Previousgpt-oss Reinforcement Learning NextVision Reinforcement Learning (VLM RL)

Last updated 0 minutes ago

Was this helpful?