IBM Granite 4.0
How to run IBM Granite-4.0 with Unsloth GGUFs on llama.cpp, Ollama and how to fine-tune!
IBM releases Granite-4.0 models with 3 sizes including Micro (3B), Tiny (7B/1B active) and Small (32B/9B active). Trained on 15T tokens, IBM’s new Hybrid (H) Mamba architecture enables Granite-4.0 models to run faster with lower memory use.
Learn how to run Unsloth Granite-4.0 Dynamic GGUFs or fine-tune/RL the model. You can fine-tune Granite-4.0 with our free Colab notebook for a support agent use-case.
Unsloth Granite-4.0 uploads:
You can also view our Granite-4.0 collection for all uploads including Dynamic Float8 quants etc.
Granite-4.0 Models Explanations:
H-Small (MoE): Enterprise workhorse for daily tasks, supports multiple long-context sessions on entry GPUs like L40S (32B total, 9B active).
H-Tiny (MoE): Fast, cost-efficient for high-volume, low-complexity tasks; optimized for local and edge use (7B total, 1B active).
H-Micro (Dense): Lightweight, efficient for high-volume, low-complexity workloads; ideal for local and edge deployment (3B total).
Micro (Dense): Alternative dense option when Mamba2 isn’t fully supported (3B total).
Run Granite-4.0 Tutorials
⚙️ Recommended Inference Settings
IBM only recommends some settings such as context, so we'll use standard settings:
temperature=1.0
, top_p=1.0
, top_k=0
Temperature of 1.0
Top_K = 0
Top_P = 1.0
Recommended minimum context: 16,384
Maximum context length window: 131,072 (128K context)
Chat template:
<|start_of_role|>user<|end_of_role|>Please list one IBM Research laboratory located in the United States. You should only output its name and location.<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>Almaden Research Center, San Jose, California<|end_of_text|>
🦙 Ollama: Run Granite-4.0 Tutorial
Install
ollama
if you haven't already!
apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh
Run the model! Note you can call
ollama serve
in another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) inparams
in our Hugging Face upload! You can change the model name 'granite-4.0-h-small-GGUF
' to any Granite model like 'granite-4.0-h-micro:Q8_K_XL'.
ollama run hf.co/unsloth/granite-4.0-h-small-GGUF:UD-Q4_K_XL
📖 llama.cpp: Run Granite-4.0 Tutorial
Obtain the latest
llama.cpp
on GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ON
to-DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
If you want to use
llama.cpp
directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar toollama run
./llama.cpp/llama-cli \
-hf unsloth/granite-4.0-h-small-GGUF:UD-Q4_K_XL
OR download the model via (after installing
pip install huggingface_hub hf_transfer
). You can choose Q4_K_M, or other quantized versions (like BF16 full precision).
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/granite-4.0-h-small-GGUF",
local_dir = "unsloth/granite-4.0-h-small-GGUF",
allow_patterns = ["*UD-Q4_K_XL*"], # For Q4_K_M
)
Run Unsloth's Flappy Bird test
Edit
--threads 32
for the number of CPU threads,--ctx-size 16384
for context length (Granite-4.0 supports 128K context length!),--n-gpu-layers 99
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.For conversation mode:
./llama.cpp/llama-mtmd-cli \
--model unsloth/granite-4.0-h-small-GGUF/granite-4.0-h-small-UD-Q4_K_XL.gguf \
--threads 32 \
--jinja \
--ctx-size 16384 \
--n-gpu-layers 99 \
--seed 3407 \
--prio 2 \
--temp 1.0 \
--top-k 0 \
--top-p 1.0
🐋 Docker: Run Granite-4.0 Tutorial
If you already have Docker desktop, all your need to do is run the command below and you're done:
docker model pull hf.co/unsloth/granite-4.0-h-small-GGUF:UD-Q4_K_XL
🦥 Fine-tuning Granite-4.0 in Unsloth
Unsloth now supports all Granite 4.0 models including micro, tiny and small for fine-tuning. Training is 2x faster, use 50% less VRAM and supports 6x longer context lengths. Granite-4.0 micro and tiny fit comfortably in a 15GB VRAM T4 GPU.
Granite-4.0 free fine-tuning notebook
This notebook trains a model to become a Support Agent that understands customer interactions, complete with analysis and recommendations. This setup allows you to train a bot that provides real-time assistance to support agents.
We also show you how to train a model using data stored in a Google Sheet.

Unsloth config for Granite-4.0:
!pip install --upgrade unsloth
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/granite-4.0-h-micro",
max_seq_length = 2048, # Context length - can be longer, but uses more memory
load_in_4bit = True, # 4bit uses much less memory
load_in_8bit = False, # A bit more accurate, uses 2x memory
full_finetuning = False, # We have full finetuning now!
# token = "hf_...", # use one if using gated models
)
If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:
pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
Last updated
Was this helpful?