gpt-oss: How to Run & Fine-tune
Run & fine-tune OpenAI's new open-source models!
OpenAI releases 'gpt-oss-120b' and 'gpt-oss-20b', two SOTA open language models under the Apache 2.0 license. Both models outperform similarly sized open models in reasoning, tool use, and few-shot tasks.
Trained with reinforcement learning (RL) and insights from advanced OpenAI models, gpt-oss-120b rivals o4-mini on reasoning and runs on a single 80GB GPU. gpt-oss-20b matches o3-mini on benchmarks and fits on 16GB of memory. Both excel at function calling and CoT reasoning, surpassing o1 and GPT-4o.
gpt-oss - Unsloth Dynamic 2.0 GGUFs:
🖥️ Running gpt-oss
Below are guides for the 20B and 120B variants of the model.
⚙️ Recommended Settings
OpenAI recommends these inference settings for both models:
temperature=0.6
, top_p=1.0
, top_k=0
Temperature of 0.6
Top_K = 0 (none)
Top_P = 1.0 (none)
Recommended minimum context: 16,384
Chat template:
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-05\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|channel|>final<|message|>Hi there!<|end|><|start|>user<|message|>What is 1+1?<|end|><|start|>assistant
The end of sentence/generation token: EOS is <|return|>
Run gpt-oss-20B
To achieve inference speeds of 6+ tokens per second for our Dynamic 4-bit quant, have at least 14GB of unified memory (combined VRAM and RAM) or 14GB of system RAM alone. As a rule of thumb, your available memory should match or exceed the size of the model you’re using. GGUF Link: unsloth/gpt-oss-20b-GGUF
NOTE: The model can run on less memory than its total size, but this will slow down inference. Maximum memory is only needed for the fastest speeds.
🦙 Ollama: gpt-oss-20b Tutorial
We are working with the Ollama team to support it. Stay tuned! You can run the model on LM Studio or llama.cpp below:
✨ Llama.cpp: Run gpt-oss-20b Tutorial
Obtain the latest
llama.cpp
on GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ON
to-DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
You can directly pull from HuggingFace via:
./llama.cpp/llama-cli \ -hf unsloth/gpt-oss-20b-GGUF:F16 \ --jinja -ngl 99 --threads -1 --ctx-size 16384 \ --temp 0.6 --top-p 1.0 --top-k 0
Download the model via (after installing
pip install huggingface_hub hf_transfer
).
# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/gpt-oss-20b-GGUF",
local_dir = "unsloth/gpt-oss-20b-GGUF",
allow_patterns = ["*F16*"],
)
Run gpt-oss-120b:
To achieve inference speeds of 6+ tokens per second for our 1-bit quant, we recommend at least 66GB of unified memory (combined VRAM and RAM) or 66GB of system RAM alone. As a rule of thumb, your available memory should match or exceed the size of the model you’re using. GGUF Link: unsloth/gpt-oss-120b-GGUF
NOTE: The model can run on less memory than its total size, but this will slow down inference. Maximum memory is only needed for the fastest speeds.
📖 Llama.cpp: Run gpt-oss-120b Tutorial
For gpt-oss-120b, we will specifically use Llama.cpp for optimized inference.
If you want a full precision unquantized version, use our F16
versions!
Obtain the latest
llama.cpp
on GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ON
to-DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp
You can directly use llama.cpp to download the model but I normally suggest using
huggingface_hub
To use llama.cpp directly, do:./llama.cpp/llama-cli \ --hf unsloth/gpt-oss-120b-GGUF:F16 \ --threads -1 \ --ctx-size 16384 \ --n-gpu-layers 99 \ -ot ".ffn_.*_exps.=CPU" \ --temp 0.6 \ --min-p 0.0 \ --top-p 1.0 \ --top-k 0.0 \
Or, download the model via (after installing
pip install huggingface_hub hf_transfer
). You can choose UD-Q2_K_XL, or other quantized versions..# !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/gpt-oss-120b-GGUF", local_dir = "unsloth/gpt-oss-120b-GGUF", allow_patterns = ["*F16*"], )
Run the model in conversation mode and try any prompt.
Edit
--threads -1
for the number of CPU threads,--ctx-size
262114 for context length,--n-gpu-layers 99
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
Use -ot ".ffn_.*_exps.=CPU"
to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity. More options discussed here.
./llama.cpp/llama-cli \
--model unsloth/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf \
--threads -1 \
--ctx-size 16384 \
--n-gpu-layers 99 \
-ot ".ffn_.*_exps.=CPU" \
--temp 0.6 \
--min-p 0.0 \
--top-p 1.0 \
--top-k 0.0 \
Also don't forget about the new Qwen3 update. Run Qwen3-235B-A22B-Instruct-2507 locally with llama.cpp.
🛠️ Improving generation speed
If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.
Normally, -ot ".ffn_.*_exps.=CPU"
offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU"
This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU"
if you have even more GPU memory. This offloads only up projection MoE layers.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"
means to offload gate, up and down MoE layers but only from the 6th layer onwards.
The latest llama.cpp release also introduces high throughput mode. Use llama-parallel
. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.
📐How to fit long context (256K to 1M)
To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16
) include the below.
--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
You should use the _1
variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1
You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON
, and use --flash-attn
to enable it.
🦥 Fine-tuning gpt-oss with Unsloth
Unsloth is working hard to support the models! Stay tuned!
If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:
pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo
Last updated