gpt-oss: How to Run & Fine-tune
Run & fine-tune OpenAI's new open-source models!
OpenAI releases 'gpt-oss-120b' and 'gpt-oss-20b', two SOTA open language models under the Apache 2.0 license. Both 131k context models outperform similarly sized open models in reasoning, tool use, and few-shot tasks.
Trained with RL and insights from advanced OpenAI models, gpt-oss-120b rivals o4-mini and runs on a single 80GB GPU. gpt-oss-20b rivals o3-mini and fits on 16GB of memory. Both excel at function calling and CoT reasoning, surpassing o1 and GPT-4o.
Includes Unsloth's chat template fixes. For best results, use our quants!
gpt-oss - Unsloth Dynamic 2.0 GGUFs:
🖥️ Running gpt-oss
Below are guides for the 20B and 120B variants of the model.
⚙️ Recommended Settings
OpenAI recommends these inference settings for both models:
temperature=0.6
, top_p=1.0
, top_k=0
Temperature of 0.6
Top_K = 0 (none)
Top_P = 1.0 (none)
Recommended minimum context: 16,384
Maximum context length window: 131,072
Chat template:
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-05\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|channel|>final<|message|>Hi there!<|end|><|start|>user<|message|>What is 1+1?<|end|><|start|>assistant
The end of sentence/generation token: EOS is <|return|>
Run gpt-oss-20B
To achieve inference speeds of 6+ tokens per second for our Dynamic 4-bit quant, have at least 14GB of unified memory (combined VRAM and RAM) or 14GB of system RAM alone. As a rule of thumb, your available memory should match or exceed the size of the model you’re using. GGUF Link: unsloth/gpt-oss-20b-GGUF
NOTE: The model can run on less memory than its total size, but this will slow down inference. Maximum memory is only needed for the fastest speeds.
You can run the model on LM Studio or llama.cpp for now below:
✨ Llama.cpp: Run gpt-oss-20b Tutorial
Obtain the latest
llama.cpp
on GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ON
to-DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
You can directly pull from HuggingFace via:
./llama.cpp/llama-cli \ -hf unsloth/gpt-oss-20b-GGUF:F16 \ --jinja -ngl 99 --threads -1 --ctx-size 16384 \ --temp 0.6 --top-p 1.0 --top-k 0
Download the model via (after installing
pip install huggingface_hub hf_transfer
).
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/gpt-oss-20b-GGUF",
local_dir = "unsloth/gpt-oss-20b-GGUF",
allow_patterns = ["*F16*"],
)
Run gpt-oss-120b:
To achieve inference speeds of 6+ tokens per second for our 1-bit quant, we recommend at least 66GB of unified memory (combined VRAM and RAM) or 66GB of system RAM alone. As a rule of thumb, your available memory should match or exceed the size of the model you’re using. GGUF Link: unsloth/gpt-oss-120b-GGUF
NOTE: The model can run on less memory than its total size, but this will slow down inference. Maximum memory is only needed for the fastest speeds.
📖 Llama.cpp: Run gpt-oss-120b Tutorial
For gpt-oss-120b, we will specifically use Llama.cpp for optimized inference.
If you want a full precision unquantized version, use our F16
versions!
Obtain the latest
llama.cpp
on GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ON
to-DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp
You can directly use llama.cpp to download the model but I normally suggest using
huggingface_hub
To use llama.cpp directly, do:./llama.cpp/llama-cli \ --hf unsloth/gpt-oss-120b-GGUF:F16 \ --threads -1 \ --ctx-size 16384 \ --n-gpu-layers 99 \ -ot ".ffn_.*_exps.=CPU" \ --temp 0.6 \ --min-p 0.0 \ --top-p 1.0 \ --top-k 0.0 \
Or, download the model via (after installing
pip install huggingface_hub hf_transfer
). You can choose UD-Q2_K_XL, or other quantized versions..# !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/gpt-oss-120b-GGUF", local_dir = "unsloth/gpt-oss-120b-GGUF", allow_patterns = ["*F16*"], )
Run the model in conversation mode and try any prompt.
Edit
--threads -1
for the number of CPU threads,--ctx-size
262114 for context length,--n-gpu-layers 99
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
Use -ot ".ffn_.*_exps.=CPU"
to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity. More options discussed here.
./llama.cpp/llama-cli \
--model unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf \
--threads -1 \
--ctx-size 16384 \
--n-gpu-layers 99 \
-ot ".ffn_.*_exps.=CPU" \
--temp 0.6 \
--min-p 0.0 \
--top-p 1.0 \
--top-k 0.0 \
🛠️ Improving generation speed
If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.
Normally, -ot ".ffn_.*_exps.=CPU"
offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU"
This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU"
if you have even more GPU memory. This offloads only up projection MoE layers.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"
means to offload gate, up and down MoE layers but only from the 6th layer onwards.
The latest llama.cpp release also introduces high throughput mode. Use llama-parallel
. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.
📐How to fit long context (256K to 1M)
To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16
) include the below.
--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
You should use the _1
variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1
You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON
, and use --flash-attn
to enable it.
🦥 Fine-tuning gpt-oss with Unsloth
Unsloth is working hard to support the models! Stay tuned!
If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:
pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
Last updated