SGLang Deployment & Inference Guide

Guide on saving and deploying LLMs to SGLang for serving LLMs in production

You can serve any LLM or fine-tuned model via SGLang for low-latency, high-throughput inference. SGLang supports text, image/video model inference on any GPU setup, with support for some GGUFs.

Deploying gpt-oss-120b Tutorial

See below for standard SGLang saving instructions:

💻Setting up SGLang

For pip or uv installation on NVIDIA GPUs:

# OPTIONAL use a virtual environment
python -m venv unsloth_env
source unsloth_env/bin/activate

# Install Rust and outlines-core
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -y
source $HOME/.cargo/env
sudo apt-get install -y pkg-config libssl-dev

# Install SGLang
pip install --upgrade pip
pip install uv
uv pip install "sglang" --prerelease=allow

For Docker setups run:

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path unsloth/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

🚚Deploying SGLang models

After saving your fine-tune, you can simply do:

python3 -m sglang.launch_server \
    --model-path unsloth/Llama-3.2-1B-Instruct \
    --host 0.0.0.0 --port 30000

You can then use an OpenAI Chat completions library to call the model (in another terminal or using tmux):

# Install openai via pip install openai
from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://0.0.0.0:30000/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/Llama-3.2-1B-Instruct",
    messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.content)

And you will get 2 + 2 = 4.

🦥Unsloth SGLang Instructions

After fine-tuning Fine-tuning LLMs Guide or using our notebooks at Unsloth Notebooks, you can save or deploy your models directly through SGLang, within a single workflow.

To save to 16-bit for SGLang, use:

model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit")
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

To save just the LoRA adapters, either use:

model.save_pretrained("model")
tokenizer.save_pretrained("tokenizer")

Or just use our builtin function to do that:

model.save_pretrained_merged("model", tokenizer, save_method = "lora")
model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

🚃gpt-oss-20b: Unsloth & SGLang Deployment Guide

Below is a step-by-step tutorial with instructions for training the gpt-oss-20b using Unsloth and deploying it with SGLang. It includes performance benchmarks across multiple quantization formats.

1

Unsloth Fine-tuning and Exporting Formats

If you're new to fine-tuning, you can read our guide, or try the gpt-oss 20B finetuning notebook at gpt-oss: How to Run & Fine-tune After training, you can export the model in multiple formats:

# 1. 16-bit merged model (full precision)
model.save_pretrained_merged(
    "finetuned_model", 
    tokenizer, 
    save_method = "merged_16bit",
)

# 2. 4-bit MXFP4 model (ONLY for GPT-OSS)
model.save_pretrained_merged(
    "finetuned_model", 
    tokenizer, 
    save_method = "mxfp4", # (ONLY FOR GPT-OSS otherwise choose merged_16bit)
)
2

Deployment with SGLang

python -m sglang.launch_server \
    --model-path finetuned_model \
    --host 0.0.0.0 --port 30002
3

Use OpenAI Completions

from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://0.0.0.0:30002/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "finetuned_model",
    messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.content)

## OUTPUT ##
# <|channel|>analysis<|message|>The user asks a simple math question. We should answer 4. Also we should comply with policy. No issues.<|end|><|start|>assistant<|channel|>final<|message|>2 + 2 equals 4.

⚡Benchmarking SGLang

Below is code you can run to test the performance speed of your finetuned model:

python -m sglang.launch_server \
    --model-path finetuned_model \
    --host 0.0.0.0 --port 30002

Then in another terminal or via tmux:

# Batch Size=8, Input=1024, Output=1024
python -m sglang.bench_one_batch_server \
    --model finetuned_model \
    --base-url http://0.0.0.0:30002 \
    --batch-size 8 \
    --input-len 1024 \
    --output-len 1024

You will see the benchmarking like below:

We used a B200x1 GPU with gpt-oss-20b and got the below results (~2,500 tokens throughput)

Batch/Input/Output
TTFT (s)
ITL (s)
Input Throughput
Output Throughput

8/1024/1024

0.40

3.59

20,718.95

2,562.87

8/8192/1024

0.42

3.74

154,459.01

2,473.84

See https://docs.sglang.ai/advanced_features/server_arguments.html for server arguments for SGLang.

🏃SGLang Interactive Offline Mode

You can also use SGLang in offline mode (ie not a server) inside a Python interactive environment.

import sglang as sgl
engine = sgl.Engine(model_path = "unsloth/Qwen3-0.6B", random_seed = 42)

prompt = "Today is a sunny day and I like"
sampling_params = {"temperature": 0, "max_new_tokens": 256}
outputs = engine.generate(prompt, sampling_params)["text"]
print(outputs)
engine.shutdown()

🎇GGUFs in SGLang

SGLang also interestingly supports GGUFs! Qwen3 MoE is still under construction, but most dense models (Llama 3, Qwen 3, Mistral etc) are supported.

First install the latest gguf python package via:

pip install -e "git+https://github.com/ggml-org/llama.cpp.git#egg=gguf&subdirectory=gguf-py" # install a python package from a repo subdirectory

Then for example in the offline mode, do:

from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
    "unsloth/Qwen3-32B-GGUF",
    filename = "Qwen3-32B-UD-Q4_K_XL.gguf",
)
import sglang as sgl
engine = sgl.Engine(model_path = model_path, random_seed = 42)

prompt = "Today is a sunny day and I like"
sampling_params = {"temperature": 0, "max_new_tokens": 256}
outputs = engine.generate(prompt, sampling_params)["text"]
print(outputs)
engine.shutdown()

Last updated

Was this helpful?